Dipfake video on one frame /Geek magazine
First Order Motion Model
Is it possible to make an entire movie from one photograph? And having recorded the movements of one person, replace him with another in the video? Of course, the answer to these questions is extremely important for areas such as cinema, photography, and the development of computer games. The solution could be digital photo processing using specialized software. The problem in question among specialists in this field is called the task of automatic synthesis of video or image animation. heatmaps , one for each key point. The final layer of the decoder uses softmax to predict heatmaps, which can be interpreted as validity maps for the detection of key points.
The authors use the convolutional neural network
at key points (here the coordinates of the key points were denoted by 3–3–3439. 3–3–3187. 3–3–3441.), and the initial frame is 3–3–3439.
. It is important that the parameters are
, such as edges or texture, are aligned pixel by pixel according to
, but not with
. So that the input data is already aligned with
, we deform the original frames and get
(3-3-33439. 3-3-33211. 3-3-33441.), Each of which is aligned with 3-3-33439.
in the vicinity of a key point. Heatmaps and converted
are combined and processed in U-Net.
expressed by the formula:
- a mask for highlighting the neighborhood of the control point for which this conversion occurs (3-3-33439. 3-3-33231. 3-3-33441. - to add a background) and 3-3-33439.
expressed by the formula:
Let me remind you that the original image is
not aligned pixel by pixel with the generated image
. To cope with this, the authors use an object deformation strategy. After two down-sampling blocks, we get a map of objects
. Then we deform
in accordance with
. In the presence of overlap in
, the optical flux may not be enough to generate
. Here we introduce the concept of a floor map of to mark areas of the map of objects that need to be drawn because they are missing from the image
. The new feature map looks like this:
means reverse deformation operation, and
- Hadamard's product (bitwise logical multiplication of the corresponding members of two sequences of equal length).
We evaluate the overlap mask using a sparse representation of key points, adding a channel to the final layer of the motion transfer network.
fed to subsequent layers of the image generation module to visualize the resulting frame.
Loss functions 3-3-33496.
The network trains continuously, combining several loss functions. Reconstruction loss based on is used. Johnson's perceptual loss function . As a key loss function for movements in the frame, the pre-trained VGG-19 network is used. The reconstruction loss formula is presented below:
- restored frame,
- frame with the original motion,
- i-th channel element extracted from a specific layer of VGG-1?
- the number of channels of elements in this layer.
Imposing a 3-3-33496 equivariance constraint.
The key point predictor does not require any knowledge of key points during training. This can lead to unstable results. The restriction of equivariance is one of the most important factors determining the location of key points without a teacher. This forces the model to predict key points that do not contradict the known geometric transformations. Since the motion estimation module not only predicts key points, but also the Jacobians, we extend the function of equivalence losses to additionally include restrictions on the Jacobians.
The authors suggest that the image is
undergoes spatial deformation
, which can be either an affine transformation or thin plane spline . After this deformation, we get a new image
. Applying an extended motion estimate to both images, we get a set of local approximations for
. The standard equivalence constraint is written as
After expanding both parts of the equation in a Taylor series, we obtain the following restrictions (here 3-3-33439. 3-3-33378. 3-3-33441. Is the unit square matrix): 3-3-33520.
To limit the positions of key points, the function
. The authors use equal
weights when combining loss functions in all experiments, since the model is not sensitive to the relative weights of reconstruction loss and 2x equivariant loss functions.
To animate an object from the original frame
using video frames
each frame is
processed independently to obtain
. To do this, frame
relative motion between
. That is, we apply the
in the vicinity of each point
It is important to note that in connection with this there is a limitation - objects on frames 3–3–3439.
should have similar postures.
The model sets records!
The model was trained and tested on 4 different data sets: 3-3-33520.
VoxCeleb - Dataset of people from 22496 videos taken from YouTube;
UvA-Nemo - dataset for facial analysis, consisting of 1240 videos;
BAIR robot pushing - a dataset consisting of videos collected by Sawyer's robotic arm that puts various objects on the table. It has 4?880 training and 128 test videos.
280 TaiChi YouTube video.
The results of the work were compared with X2Face and Monkey-Net , since they are the only existing solutions for animating images without prior modeling.
As can be seen from the table, the First Order Motion model is superior to other approaches in all respects.
The long-awaited examples of
Now try it yourself! It is very simple, everything is prepared by here .
It may be interesting
Hey what a brilliant post I have come across and believe me I have been searching out for this similar kind of post for past a week and hardly came across this. Thank you very much and will look for more postings from you. [Url = https: //mtsoul.net] 먹튀 검증 [/ url]