Motion Capture

Neural Networks

GameDev

Machine Learning

Animation

19 мая

13 minutes

Real-Time Character Animation with Machine Learning - An Overview of PFNN, MANN, and LMM

An in-depth look at modern machine learning-based real-time character animation technologies. We examine the PFNN, MANN, and LMM architectures, their principles and pipelines, data requirements, and network structures.

Georgiy Markelov

3D Graphics and Machine Learning Developer

We implemented Learned Motion Matching, developed a plugin for Unreal Engine 5, and integrated trained models into a real-time animation pipeline. Next, we will review existing character animation solutions based on neural networks.

Introduction

Until recently, the standard approaches in character animation included keyframe animation or procedural animation — a family of fundamentally different methods based on inverse kinematics, ragdoll physics, or more complex systems (e.g., GTA IV — Euphoria). However, despite their widespread use, these methods have significant drawbacks: unrealistic results, high cost, limited expressiveness, reliance on manual labor, and difficulty maintaining a consistent artistic style. Then came motion matching, which delivered an entirely different level of animation quality, but only AAA-scale studios could afford such systems. Moreover, motion matching is extremely demanding on memory due to the need to store the entire animation library in RAM.

Some of these shortcomings can be naturally addressed by applying machine learning, thanks to its low memory consumption, scalability with respect to data, and ability to generalize. Today, a new shift can be observed: more and more tasks related to character movement, facial expressions, and behavior are being handed over to machine learning models. The reason is simple — games, VR/AR systems, virtual actors, and interactive simulations require not just beautiful animation, but realistic real-time behavior that adapts to the environment and user actions, which traditional systems cannot provide.

Neural networks can learn from large motion capture datasets, predict movement for upcoming frames, synthesize transitions between poses, control gait, balance, obstacle reactions, and even facial expressions synchronized with speech. The result is animation that looks natural while being generated on the fly — without pre-recorded clips. However, new capabilities come with new challenges: performance, stability, control over results, data quality requirements, and integration into existing pipelines.

In this article, I will present some of the existing machine learning-based solutions for character animation. Since the approaches discussed below are closely related to motion matching technology, I will begin with a brief overview of it.

Motion matching is a search algorithm that selects the pose from an animation database that best fits the current context. At present, this technology delivers the most responsive and plausible animations. It has been used in the following projects: The Last of Us, For Honor, Fortnite.

Instead of describing animation through a state graph, motion matching allows animators to specify pose parameters (features), and the best-matching pose is automatically selected using a nearest-neighbor search. This means that animation quality and variety depend directly on the size of the database. Since the algorithm takes poses from the database as-is, applying only blending and post-processing, the quality of the original animation is preserved, and animators retain control over the result. The motion matching process can be summarized as follows:

Compute the desired movement trajectory
Search the database for the most suitable pose, considering pose parameters and the movement trajectory
Assign the closest matching pose to the character
Post-processing
Repeat steps 1–4

The process of adding new animation material to the database is straightforward and can be performed and debugged in real time. As a result, memory consumption grows linearly with the amount of data and the number of pose search parameters. Ultimately, the task comes down to finding a balance between result quality, memory consumption, and computational performance.

Thus, the following key disadvantages of motion matching can be identified:

The need for a massive dataset — for example, in TLOU 2, Ellie and Abby alone have a total of 2,627 animation clips — 2 hours and 17 minutes of animations
Lack of generativity

Typical Pipeline

The pipeline is conceptually the same for all three architectures and looks as follows:

Preprocessing: at this stage, training data is prepared, controller parameters are automatically extracted, and terrain is matched to movement using a separate heightmap
Training: the neural network learns to output character movement at each frame using controller parameters
Real-time inference: input data for the neural network (controller and environment) is collected and fed into the system, which determines the character's movement

The source training data is standard motion capture, recorded as a long sequence describing various gaits and movement directions. During preprocessing, the entire sequence is mirrored to augment the dataset.

Bone position and velocity data is used in an autoregressive manner — the computed result from the previous frame is used as input for the next one.

Phase-functioned neural networks for character control (PFNN)

PFNN is a neural network architecture that operates by generating the weights of a regression network at each animation frame as a function of phase — a variable representing the movement cycle time. These weights are then used to perform regression from controller parameters at the current frame to the corresponding character pose. The proposed architecture is suitable for real-time operation thanks to its computational speed and lightweight memory footprint. Additionally, by sacrificing compactness, performance can be further improved through precomputation of the phase function.

A significant improvement in animation expressiveness is achieved through the dynamic adjustment of neural network weights based on the phase function — this approach allows the network to learn from a high-dimensional dataset where environmental geometry and human motion data are interconnected. The PFNN architecture avoids mixing data from different phases, and the regression function changes smoothly over time depending on the phase. After training, the system can automatically generate appropriate and expressive movements for a character traversing rough terrain, jumping, and avoiding obstacles — in both natural and urban environments.

The input to the neural network consists of the previous pose and controller input, while the output includes the phase change, the current character pose, and some additional parameters.

Along with the animation, controller parameters are provided as training data, consisting of the movement phase, gait semantic labels, character movement trajectory, and a heightmap along the trajectory.

Phase labeling: performed semi-automatically. Foot contact with the surface is computed automatically via velocity and then manually corrected. Once contact data is obtained, phases are assigned: when the right foot contacts the surface, phase = 0, then the left foot = $\pi$ , right foot again = $2\pi$ ( $0 \leq p \leq 2\pi$ ). Interpolation is performed between frames.
Gait labeling: performed manually, represented as a binary vector to disambiguate similar movement types and describe specific movement scenarios.
Trajectory and heightmap: the root transformation of the character describing the movement trajectory is extracted. Then, along the entire trajectory and perpendicular to it on both sides, the surface height is computed.

To describe the character's state, local (relative to root transformation) bone positions and velocities at the current animation frame are used. For trajectory construction, 5 frames from the future and 6 from the past are sampled, covering a total of 1 second of past movement and 0.9 seconds of future movement. For each sample, the position and direction of the trajectory relative to the root transformation are extracted, along with a binary vector describing the gait and the surface height under and beside the trajectory at a distance of 25 cm. Thus, the full input vector for the neural network for a single animation frame is:

x_i= \left\{ t_i^p t_i^d t_i^h t_i^g j_{i-1}^p j_{i-1}^v \right\} \in R^n

$i$ — current frame
$i-1$ — previous frame
$s$ — number of sampled frames (12)
$b$ — number of bones in the skeleton
$t_i^p \in R^{2s}$ — trajectory positions in the 2D horizontal plane
$t_i^d \in R^{2s}$ — trajectory directions in the 2D horizontal plane
$t_i^h \in R^{3s}$ — heights at points to the left, right, and center of the trajectory
$t_i^g \in R^{5s}$ — semantic labels describing the character's gait and other information
$j_{i-1}^p \in R^{3b}$ — local bone positions
$j_{i-1}^v \in R^{3b}$ — local bone velocities

The original paper used the following labels for $t_i^g$ :

idle
walking
running
jumping
crouching

The output of the neural network inference is the vector:

y_i = \left\{ t_{i+1}^p t_{i+1}^d j_i^p j_i^v j_i^a \dot{r_i^x} \dot{r_i^z} \dot{r_i^a} \dot{p_i} c_i \right\} \in R^m

$s$ — number of sampled frames (12)
$b$ — number of bones in the skeleton
$t_{i+1}^p \in R^{2s}$ — predicted trajectory positions at the next frame
$t_{i+1}^d \in R^{2s}$ — predicted trajectory directions at the next frame
$j_i^p \in R^{3b}$ — local bone positions
$j_i^v \in R^{3b}$ — local bone velocities
$j_i^a \in R^{3b}$ — bone angles expressed in exponential map form
$\dot{r_i^x} \in R$ — linear velocity of the root transformation along the X axis relative to the "forward" direction
$\dot{r_i^z} \in R$ — linear velocity of the root transformation along the Z axis relative to the "forward" direction
$\dot{r_i^a} \in R$ — angular velocity of the root transformation around the vertical axis
$\dot{p_i} \in R$ — phase change
$c_i \in R^4$ — foot contact information (toe and heel) with the surface

The neural network consists of 3 layers:

\Phi(x; a) = W_2 ELU(W_1 ELU(W_0 x + b_0) + b_1) + b_2

The number of neurons in each layer is 512. The network weights $a$ are computed depending on parameters $\beta$ at each frame by a separate phase function $a=\Theta(p;\beta)$ . This function can be another neural network or a Gaussian process, but the paper proposes using a cyclic cubic Catmull-Rom spline with 4 control points. This approach means that each control point $a_k$ represents a specific configuration of network weights $a$ , and the function $\Theta$ performs smooth interpolation between these configurations.

$\Theta(p; \beta) = a_{k_1}$

$+w(\frac{1}{2}a_{k_2} - \frac{1}{2}a_{k_0})$

$+w^2(a_{k_0} - \frac{5}{2}a_{k_1} + 2a_{k_2} - \frac{1}{2}a_{k_3})$

$+w^3(\frac{3}{2}a_{k_1} - \frac{3}{2}a_{k_2} + \frac{1}{2}a_{k_3} - \frac{1}{2}a_{k_0})$

$w = \frac{4p}{2\pi} (mod 1)$

$k_n = \left[\frac{4p}{2\pi} \right] + n - 1 (mod 4)$

Training the neural network amounts to solving the optimization problem for the phase function parameters $\beta = \left\{ a_0 a_1 a_2 a_3 \right\}$ . The following loss function is used:

Cost(X, Y, P, \beta) = \lVert Y - \Phi \left(X; \Theta\left(P; \beta\right)\right) \rVert_2^2 + \gamma \lvert \beta \rvert_1

$\lVert Y - \Phi \left(X; \Theta\left(P; \beta\right)\right) \rVert_2^2$ — mean squared error
$\gamma \lvert \beta \rvert_1$ — regularization introduced to prevent the weights from becoming too large ( $\gamma = 0.01$ )

Adam was chosen as the optimizer.

In real time, at each animation frame, the phase $p$ and vector $x$ are fed as input to the neural network. The velocity and direction of movement received from the controller for each future frame are blended with the data predicted by the neural network at the previous frame $(t_{i+1}^p t_{i+1}^d)$ using the following formula:

TrajectoryBlend(a_0, a_1, t, \tau) = \left(1 - t^\tau\right) a_0 + t^\tau a_1

$0 \leq t \leq 1$
$\tau$ — additional bias controlling the character's responsiveness

Mode-adaptive neural networks for quadruped motion control (MANN)

This work extends the PFNN architecture to the animation of quadruped characters. Due to fundamental differences in movement patterns, it is impossible to define a single phase for all four limbs when the gait changes. Manual labeling of an unstructured dataset also becomes impractical. Therefore, the phase function is replaced by an additional neural network. The resulting system includes a motion prediction network and an expert selection network (gating network — similar to the mixture of experts architecture): while the first network predicts the character's state based on the previous frame's state and controller input, the second network dynamically updates the weights of the first network by selecting and mixing a new entity called "expert weights," each of which corresponds to a specific movement pattern. This architecture allows the networks to learn from data without labeled gaits, completely eliminating the phase labeling stage.

Movement classification is performed manually to identify 6 movement classes: locomotion, sitting, standing, waiting, lying down, and jumping. This is done so that in real time the user can specify the movement class via the controller.

The paper examines four gait types: walk, pace, trot, and gallop. Although the system does not require these labels for real-time character control, an analysis of the distribution of these types in the dataset is performed based on velocity calculations.

Input and output data are generally similar to PFNN: 5 future frames and 6 past frames are also sampled, the character's root transformation and movement direction are computed.

The input data vector:

x_i = \left\{ t_i^p t_i^d t_i^v t_i^{\hat v} t_i^a j_{i-1}^p j_{i-1}^r j_{i-1}^v \right\} \in R^n

$i$ — current frame
$i-1$ — previous frame
$s$ — number of sampled frames (12)
$b$ — number of bones
$t_i^p \in R^{2s}$ — trajectory positions in the 2D horizontal plane
$t_i^d \in R^{2s}$ — trajectory directions in the 2D horizontal plane
$t_i^v \in R^{2s}$ — velocities at trajectory points in the 2D horizontal plane
$t_i^{\hat v} \in R^{1s}$ — desired velocity at trajectory points
$t_i^a \in R^{6s}$ — one-hot vector of movement classes at trajectory points
$j_{i-1}^p \in R^{3b}$ — local bone positions
$j_{i-1}^r \in R^{6b}$ — local bone rotations
$j_{i-1}^v \in R^{3b}$ — local bone velocities

Adding bone rotations to the input vector allowed for more responsive animation.

The output of the neural network inference is the vector:

y_i = \left\{ t_{i+1}^p t_{i+1}^d t_{i+1}^v j_i^p j_i^r j_i^v \dot r_i^x \dot r_i^z \dot r_i^a \right\} \in R^m

$i$ — current frame
$i+1$ — next frame
$s$ — number of sampled frames (12)
$b$ — number of bones
$t_{i+1}^p \in R^{2s}$ — trajectory positions
$t_{i+1}^d \in R^{2s}$ — trajectory directions
$t_{i+1}^v \in R^{2s}$ — velocities at trajectory points
$j_i^p \in R^{3b}$ — local bone positions
$j_i^r \in R^{6b}$ — local bone rotations
$j_i^v \in R^{3b}$ — local bone velocities
$\dot r_i^x \in R$ — linear velocity of the root transformation along the X axis
$\dot r_i^z \in R$ — linear velocity of the root transformation along the Z axis
$\dot r_i^a ∈ R$ — angular velocity of the root transformation in the 2D horizontal plane

Bone rotations are represented as relative "up" and "forward" vectors to avoid quaternion interpolation issues during neural network training.

The motion prediction network architecture is similar to PFNN, but the weights $a$ are computed by mixing $K$ expert weights $\beta = \left\{ a_1, \dots, a_k \right\}$ ,

where $a = \sum_{i=1}^K \omega_i a_i$ . $K$ is a tunable meta-parameter depending on the complexity and size of the training data. $\omega = \left\{ \omega_1, \dots , \omega_K \right\}$ are the mixing coefficients computed by the gating network.

The gating network architecture consists of 3 layers:

\Omega(\hat x; \mu) = \sigma(W_2^{'}ELU(W_1^{'} ELU(W_0^{'}\hat x + b_0^{'}) + b_1^{'}) + b_2^{'})

$\hat x \in R^{19}$ — a subset of $x$ including foot bone velocities, current movement classes, and the desired character velocity.
$\sigma(\cdot)$ — softmax function for normalizing inputs so that their sum equals 1, which is necessary for subsequent linear mixing

The network parameters $\mu$ are defined as follows:

\mu = \left\{ W_0^{'} \in R^{h^{'}\times{19}}, W_1^{'} \in R^{h^{'}\times{h^{'}}}, W_2^{'} \in R^{K\times{h^{'}}}, b_0^{'} \in R^{h^{'}}, b_1^{'} \in R^{h^{'}}, b_2^{'} \in R^K \right\}

$h^{'}$ — number of neurons in hidden layers (32)

Training the network amounts to finding $Y$ for the corresponding $X$ , which is a standard regression task. The following loss function is used (MSE between prediction and ground truth):

Cost(X, Y; \beta, \mu) = \lVert Y - \Theta(X, \Omega(\hat X; \mu); \beta \rVert_2^2

AdamWR was chosen as the optimizer.

Learned motion matching (LMM)

LMM is based on Ubisoft's own implementation of motion matching. Classic motion matching consists of 3 stages: Projection, Stepping, Decompression — for each of which a trained neural network is proposed as an alternative.

Ubisoft's motion matching implementation

For the locomotion scenario, a 27-element vector per animation frame is proposed as pose parameters:

x= \left\{ t^t t^d f^t \dot{f^t} \dot{h^t} \right\} \in R^{27}

$t^t \in R^6$ — trajectory positions in 2D projected onto the surface at 20, 40, 60 frames in the future (at 60 FPS)
$t^d \in R^6$ — trajectory direction at 20, 40, 60 frames in the future
$f^t \in R^6$ — local positions of foot bones
$\dot{f^t} \in R^6$ — linear velocities of foot bones
$\dot{h^t} \in R^3$ — linear velocity of the hip bone

Next, a vector containing all pose information for each animation frame is defined:

y = \left\{ y^t y^r \dot{y^t} \dot{y^r} \dot{r^t} \dot{r^r} o^* \right\}

$y^t \in R^3$ — local bone positions
$y^r \in R^4$ — local bone rotations in axis-angle representation
$\dot{y^t} \in R^3$ — linear bone velocities
$\dot{y^r} \in R^3$ — angular bone velocities
$\dot{r^t} \in R^3$ — linear velocities of root transformation
$\dot{r^r} \in R^3$ — angular velocities of root transformation
$o^*$ — task-specific additional data (e.g., foot contact information)

These vectors are computed for each frame, combined into matrices $X = \left[ x_0, x_1, \dots, x_{n-1} \right], Y = \left[ y_0, y_1, \dots, y_{n-1} \right]$ , called the matching database and animation database respectively, and used in the training algorithm. In real time, every N frames or when the controller input changes significantly, a query vector $\hat x$ is constructed — analogous to the pose parameter vector — and fed as input to the Projection stage. Once a new frame is found, animation playback starts from it and a transition is inserted.

Projection: nearest-neighbor search to find the pose parameter vector in $X$ that best matches $\hat x$ .
Stepping: advances the index in the matching database.
Decompression: retrieves the pose from the animation database corresponding to the current index in the matching database.

By replacing each of these stages with a neural network, the need to store the matching database and animation database in memory is eliminated. To this end, 4 neural networks are introduced:

Decompressor: eliminates the need to store $Y$ in memory, takes $x$ and a latent vector $z$ as input
Compressor: acts as an encoder for finding $z$ by compressing $y$ into a lower-dimensional vector
Stepper: together with the projector, eliminates the need to store $X$ in memory; learns the dynamics of the system by computing changes in $x_i$ and $z_i$ values to obtain $x_{i+1}$ and $z_{i+1}$ for the next frame
Projector: emulates the nearest-neighbor search, takes $\hat x$ as input and predicts the closest $x_{k^*}$ and $z_{k^*}$

By combining these 4 neural networks, the result is learned motion matching, shown in the figure below.

Since the pose parameter vector typically does not contain enough information to derive the corresponding pose, a latent space $Z$ is introduced. Its values are obtained through the Compressor network — a mapping from $y_i$ to the corresponding $z_i$ . This vector is then concatenated with $x_i$ and fed into the Decompressor, which attempts to reconstruct the original pose $y_i$ . In this way, the network discovers what information is missing from the pose parameter vector $x$ and encodes it in $z$ .

A key aspect of Decompressor training is the loss function designed to minimize the visual perception of error, which uses forward kinematics to evaluate the error in character space. Velocity-aware loss terms are also added so that the result changes smoothly over time.

Pseudocode for Compressor (C) + Decompressor (D) training algorithm

$Function \hspace{2mm} TrainDecompressor(X, Y, \Theta_C, \Theta_D):$

$\hspace{1cm} \text{//} \hspace{2mm} Compute \hspace{2mm} forward \hspace{2mm} kinematics$

$\hspace{1cm} Q \leftarrow ForwardKinematics(Y)$

$\hspace{1cm} \text{//} \hspace{2mm} Generate \hspace{2mm} latent \hspace{2mm} variables \hspace{2mm} Z$

$\hspace{1cm} Z \leftarrow C(\left[ YQ \right]^T; \Theta_C)$

$\hspace{1cm} \text{//} \hspace{2mm} Reconstruct \hspace{2mm} pose \hspace{2mm} \tilde Y$

$\hspace{1cm} \tilde Y \leftarrow D (\left[XZ \right]^T; \Theta_D)$

$\hspace{1cm} \text{//} \hspace{2mm} Recompute \hspace{2mm} forward \hspace{2mm}kinematics$

$\hspace{1cm} \tilde Q \leftarrow ForwardKinematics(\tilde Y)$

$\hspace{1cm} \text{//} \hspace{2mm} Compute \hspace{2mm} latent \hspace{2mm} regularization \hspace{2mm} losses$

$\hspace{1cm} \mathcal{L}_{lreg} \leftarrow w{lreg} \lVert Z \rVert _2^2$

$\hspace{1cm} \mathcal{L}_{sreg} \leftarrow w_{sreg} \lVert Z \rVert _1$

$\hspace{1cm} \mathcal{L}_{vreg} \leftarrow w_{vreg} \left\Vert \frac {Z_0 - Z_1}{\delta t} \right\Vert_1$

$\hspace{1cm} \text{//} \hspace{2mm} Local \hspace{2mm} \& \hspace{2mm} character \hspace{2mm} space \hspace{2mm} losses$

$\hspace{1cm} \mathcal{L}_{loc} \leftarrow w_{loc} \lVert Y \ominus \tilde Y \rVert_1$

$\hspace{1cm} \mathcal{L}_{chr} \leftarrow w_{chr} \lVert Q \ominus \tilde Q \rVert_1$

$\hspace{1cm} \text{//} \hspace{2mm} Local \hspace{2mm} \& \hspace{2mm} character \hspace{2mm} space \hspace{2mm} velocity \hspace{2mm} losses$

$\hspace{1cm} \mathcal{L}_{lvel} \leftarrow w_{lvel} \left\Vert \frac {Y_0 \ominus Y_1}{\delta t} - \frac {\tilde Y_0 \ominus \tilde Y_1}{\delta t} \right\Vert_1$

$\hspace{1cm} \mathcal{L}_{cvel} \leftarrow w_{cvel} \left\Vert \frac {Q_0 \ominus Q_1}{\delta t} - \frac {\tilde Q_0 \ominus \tilde Q_1}{\delta t} \right\Vert_1$

$\hspace{1cm} \text{//} \hspace{2mm} Update \hspace{2mm} network \hspace{2mm} parameters$

$\hspace{1cm} \Theta_C \Theta_D \leftarrow RAdam(\Theta_C \Theta_D, \nabla \sum\mathop{}_{*} \mathcal {L}_{*})$

end

After training, the Compressor network is not required for real-time operation, since it is only needed to compute $Z$ used for training the other networks.

The Stepper network is trained to take vectors $x_i$ and $z_i$ of the current frame as input and output a delta added to them to obtain vectors $x_{i+1}$ and $z_{i+1}$ for the next frame. A small window of $X$ and $Z$ vectors is taken, and the next values of pose parameters and latent variables are predicted and fed to the next frame.

Pseudocode for Stepper (S) training algorithm

$Function \hspace{2mm} TrainStepper(X, Z, s, \Theta_S):$

$\hspace{1cm} \text{//} \hspace{2mm} Set \hspace{2mm} initial \hspace{2mm} states$

$\hspace{1cm} \tilde X_0, \tilde Z_0 \leftarrow X_0, Z_0$

$\hspace{1cm} \text{//} \hspace{2mm} Predict \hspace{2mm} \tilde X \hspace{2mm} and \hspace{2mm} \tilde Z \hspace{2mm} over \hspace{2mm} a \hspace{2mm} window \hspace{2mm} of \hspace{2mm} s \hspace{2mm} frames$

$\hspace{1cm} for \hspace{2mm} i \leftarrow 1 \hspace{2mm} to \hspace{2mm} s \hspace{2mm} do$

$\hspace{2cm} \text{//} \hspace{2mm} Predict \hspace{2mm} deltas \hspace{2mm} for \hspace{2mm} \tilde X \hspace{2mm} and \hspace{2mm} \tilde Z$

$\hspace{2cm} \delta \tilde x, \delta \tilde z \leftarrow S([\tilde X_{i-1} \tilde Z_{i-1}]^T; \Theta_S)$

$\hspace{2cm} \tilde X_i \leftarrow \tilde X_{i-1} + \delta \tilde z$

$\hspace{2cm} \tilde Z_i \leftarrow \tilde Z_{i-1} + \delta \tilde z$

$\hspace{1cm} end$

$\hspace{1cm} \text{//} \hspace{2mm} Compute \hspace{2mm} losses$

$\hspace{1cm} \mathcal{L}_{xval} \leftarrow w_{xval} ||X - \tilde X ||_1$

$\hspace{1cm} \mathcal{L}_{zval} \leftarrow w_{zval} ||Z - \tilde Z ||_1$

$\hspace{1cm} \mathcal{L}_{xvel} \leftarrow w_{xvel} \left\Vert \frac {X_{0 \rightarrow s-1} - X_{1 \rightarrow s}}{\delta t} - \frac { \tilde X_{0 \rightarrow s-1} - \tilde X_{1 \rightarrow s}}{\delta t} \right\Vert_1$

$\hspace{1cm} \mathcal{L}_{zvel} \leftarrow w_{zvel} \left\Vert \frac {Z_{0 \rightarrow s-1} - Z_{1 \rightarrow s}}{\delta t} - \frac { \tilde Z_{0 \rightarrow s-1} - \tilde Z_{1 \rightarrow s}}{\delta t} \right\Vert_1$

$\hspace{1cm} \text{//} \hspace{2mm} Update \hspace{2mm} network \hspace{2mm} parameters$

$\hspace{1cm} \Theta_S \leftarrow RAdam(\Theta_S, \nabla \sum\mathop{}_{*} \mathcal {L}_{*})$

end

Finally, the Projector network completely eliminates the need to store $X$ and $Z$ in memory. For its training, a vector $x$ is taken, Gaussian noise $n$ is sampled and scaled by random noise $n^\sigma$ , the resulting value is added to $x$ to obtain $\hat x$ , and the nearest $k^*$ is found via nearest-neighbor search. The Projector is then trained to output the corresponding pose parameter vectors $x_{k^*}$ and latent variables $z_{k^*}$ .

Pseudocode for Projector (P) training algorithm

$Function \hspace{2mm} TrainProjector(x, X, Z, \Theta_{\mathcal{P}}):$

$\hspace{1cm} \text{//} \hspace{2mm} Sample \hspace{2mm} uniform \hspace{2mm} noise \hspace{2mm} magnitude \hspace{2mm} n^\sigma$

$\hspace{1cm} n^\sigma \sim \mathcal{U}(0, 1)$

$\hspace{1cm} \text{//} \hspace{2mm} Sample \hspace{2mm} gaussian \hspace{2mm} noise \hspace{2mm} vector \hspace{2mm} n$

$\hspace{1cm} n \sim \mathcal{N}(0, 1)$

$\hspace{1cm} \text{//} \hspace{2mm} Add \hspace{2mm} noise \hspace{2mm} to \hspace{2mm} feature \hspace{2mm} vector$

$\hspace{1cm} \hat x \leftarrow x + n^\sigma n$

$\hspace{1cm} \text{//} \hspace{2mm} Find \hspace{2mm} nearest \hspace{2mm} neighbor$

$\hspace{1cm} k^* = Nearest(\hat x, X)$

$\hspace{1cm} \text{//} \hspace{2mm} Project \hspace{2mm} feature \hspace{2mm} vector$

$\hspace{1cm} \tilde x, \tilde z \leftarrow \mathcal{P}(\hat x, \Theta_{\mathcal{{P}}})$

$\hspace{1cm} \text{//} \hspace{2mm} Compute \hspace{2mm} losses$

$\hspace{1cm} \mathcal{L}_{xval} \leftarrow w_{xval} \lVert x_{k^*} - \tilde x \rVert_1$

$\hspace{1cm} \mathcal{L}_{zval} \leftarrow w_{zval} \lVert z_{k^*} - \tilde z \rVert_1$

$\hspace{1cm} \mathcal{L}_{dist} \leftarrow w_{dist} \left\Vert \lVert \hat x - x_{k^*} \rVert_2^2 - \lVert \hat x - \tilde x \rVert_2^2 \right \Vert_1$

$\hspace{1cm} \text{//} \hspace{2mm} Update \hspace{2mm} network \hspace{2mm} parameters$

$\hspace{1cm} \Theta_{\mathcal{P}} \leftarrow RAdam(\Theta_{\mathcal{P}}, \nabla \sum\mathop{}_{*} \mathcal {L}_{*}))$

end

By sampling noise of varying magnitude, the Projector is made robust to perturbations at different scales.

For all loss functions in all training algorithms, the coefficients $w*$ are chosen to equalize the initial values (at the 1st training iteration).

RAdam was chosen as the optimizer.

The number of layers, neurons, and activation functions for the networks are shown in the figure below:

The runtime operation of the architecture proceeds as follows:

$\hat x$ is formed and fed as input to the Projector, which outputs $x_{k^*}$ and $z_{k^*}$
The found $x_{k^*}$ and $z_{k^*}$ are fed as input to the Stepper, which advances them in time
The result from step 2 is fed as input to the Decompressor, which outputs the final character pose

Platform limitations prevent embedding a demonstration here; video materials can be found via the original sources.

Conclusion

This article did not include performance and quality evaluations of the discussed approaches — you can find those in the original papers in more detail.

Additionally, some important components (trajectory construction, character-to-heightmap matching) were not covered as they fall outside the scope of this article. In the next article, I plan to cover some of the omitted details.

References:

Motion matching: Motion Matching and The Road to Next-Gen Animation
The Last of us Part II: Bringing Allies to Life in the 'Last of Us Part II'
PFNN: Phase-Functioned Neural Networks for Character Control
MANN: Mode-Adaptive Neural Networks for Quadruped Motion Control
LMM: Learned Motion Matching

What Do Trees Think About?

In this article, we will start exploring the classification problem. And the first algorithm will be the most intuitively understandable one — decision trees.

Real-Time Character Animation with Machine Learning - An Overview of PFNN, MANN, and LMM

Introduction

Typical Pipeline

Phase-functioned neural networks for character control (PFNN)

Mode-adaptive neural networks for quadruped motion control (MANN)

Learned motion matching (LMM)

Conclusion

Ready to discuss your project?

We will be happy to advise you in any of the available ways.