We play Mortal Kombat with TensorFlow.js
3r31259. 3r3-31. Experimenting with improvements for the forecasting model. Guess.js , I began to look closely at deep learning: to recurrent neural networks (RNN), in particular, LSTM because of their “Unreasonable effectiveness” 3r31240. in the area where Guess.js works. At the same time, I started playing with convolutional neural networks (CNN), which are also often used for time series. CNN is commonly used for the classification, recognition and detection of images. 3r31247. 3r31259. 3r31247. 3r31259. MK.js using TensorFlow.js
3r33989. 3r31247. 3r31259. 3r31247. 3r31259.
The source code for of this article 3r312340. and 3r31239. Mk.js I have on GitHub . I have not laid out the training data set, but you can build your own and train the model as described below!
3r3331. 3r31240. 3r31247. 3r31259. Having played with CNN, I remembered 3r3335. experiment 3r31240. that spent a few years ago, when browser developers released getUserMedia
API. In it, the user's camera served as a controller for playing a small jаvascript-clone of Mortal Kombat 3. You can find that game in 3r31239. GitHub repositories. . As part of the experiment, I implemented a basic positioning algorithm that classifies an image into the following classes: 3r31247. 3r31259. 3r31247. 3r31259. 3r31186. 3r31259. 3r31194. Punch with left or right hand 3r31259. 3r31194. Kick with left or right foot 3r31259. 3r31194. Steps left and right 3r31259. 3r31194. Squatting 3r31259. 3r31194. None of the above 3r31259. 3r31197. 3r31247. 3r31259. The algorithm is so simple that I can explain it in a few sentences: 3r3-31247. 3r31259. 3r31247. 3r31259.
The algorithm photographs the background. As soon as the user appears in the frame, the algorithm calculates the difference between the background and the current frame with the user. So he determines the position of the user's figure. The next step is to display the user's body in white on black. After that, vertical and horizontal histograms are constructed summarizing the values for each pixel. Based on this calculation, the algorithm determines the current body position.3r31247. 3r31259. The video shows how the program works. Source code on GitHub . 3r31247. 3r31259. 3r31247. 3r31259.
3r31259. 3r31247. 3r31259. 3r31186. 3r31259. 3r31194. Hand punches 3r31259. 3r31194. Foot kicks 3r31259. 3r31194. The other 3r31259. 3r31197. 3r31247. 3r31259. In this experiment, two volunteers helped me to collect photos ( @Lili_vs And @Gsamokovarov ). We recorded 5 QuickTime videos on my MacBook Pro, each containing 2-4 punches and 2-4 punches. 3r31247. 3r31259. 3r31247. 3r31259. Then we use ffmpeg to extract individual frames from videos and save them as images 3r31217. jpg 3r31218. : 3r31247. 3r31259. 3r31247. 3r31259. 3r31217. ffmpeg -i video.mov $ filename% 03d.jpg 3r31247. 3r31259. 3r31247. 3r31259. To run the above command, you first need install on computer 3r31217. ffmpeg . 3r31247. 3r31259. 3r31247. 3r31259. If we want to train a model, we must provide input data and corresponding output data, but at this stage we only have a bunch of images of three people in different poses. To structure the data, you need to classify frames in three categories: punches, kicks, and others. For each category, a separate directory is created, where all relevant images are moved. 3r31247. 3r31259. 3r31247. 3r31259. Thus, there should be about 200 images in each directory, similar to those shown below:
3r31259. 3r31247. 3r31259.

3r31259. 3r31247. 3r31259. 3r31101. 3r31031. np.random.seed (44)
ia.seed (44)
3r31259. def main ():
for i in range (? 191): 3r31259. draw_single_sequential_images (str (i), "others", "others-aug")
for i in range (? 191): 3r31259. draw_single_sequential_images (str (i), "hits", "hits-aug")
for i in range (? 191): 3r31259. draw_single_sequential_images (str (i), "kicks", "kicks-aug")
3r31259. def draw_single_sequential_images (filename, path, aug_path):
image = misc.imresize (ndimage.imread (path + "/" + filename + ".jpg"), (5? 100))
sometimes = lambda aug: iaa.Sometimes (0.? aug)
seq = iaa.Sequential (
[
iaa.Fliplr(0.5), # horizontally flip 50% of all images
# crop images by -5% to 10% of their height/width
sometimes(iaa.CropAndPad(
percent=(-0.05, 0.1),
pad_mode=ia.ALL,
pad_cval=(0, 255)
)),
sometimes(iaa.Affine(
scale={"x": (0.8, 1.2), "y": (0.8, 1.2)}, # scale images to 80-120% of their size, individually per axis
translate_percent={"x": (-0.1, 0.1), "y": (-0.1, 0.1)}, # translate by -10 to +10 percent (per axis)
rotate=(-5, 5),
shear=(-5, 5), # shear by -5 to +5 degrees
order=[0, 1], # use nearest neighbor or bilinear interpolation (fast)
cval = (? 255), # if mode is constant, use a cval between 0 and 255
mode = ia. ALL # use any of the scikit-image's warping modes (see image from the top for examples)
),
iaa.Grayscale (alpha = (0.? 1.0)),
iaa.Invert (0.0? per_channel = False), # invert color channels
# execute 0 to 5 of the following (less important) augmenters per image
#
iaa.SomeOf ((? 5),
[
iaa.OneOf([
iaa.GaussianBlur((0, 2.0)), # blur images with a sigma between 0 and 2.0
iaa.AverageBlur(k=(2, 5)), # blur image using local means with kernel sizes between 2 and 5
iaa.MedianBlur(k=(3, 5)), # blur image using local medians with kernel sizes between 3 and 5
]),
iaa.Sharpen (alpha = (? 1.0), lightness = (0.7? 1.5)), # sharpen images 3r31259. iaa.Emboss (alpha = (? 1.0), strength = (? 2.0)), # emboss images
iaa.AdditiveGaussianNoise (loc = ? scale = (0.? ??? * 255), per_channel = 0.5), # add noise to images
iaa.Add ((- 1? 10), per_channel = 0.5), # change brightness of images (by -10 to 10 of original value)
iaa.AddToHueAndSaturation ((- 2? 20)), # change hue and saturation
#
#
iaa.OneOf ([
iaa.Multiply((0.9, 1.1), per_channel=0.5),
iaa.FrequencyNoiseAlpha(
exponent=(-2, 0),
first=iaa.Multiply((0.9, 1.1), per_channel=True),
second=iaa.ContrastNormalization((0.9, 1.1))
)
]), 3r31259. iaa.ContrastNormalization ((0.? 2.0), per_channel = 0.5), # improve or worsen the contrast
], 3r31259. random_order = True
) 3r31259. ], 3r31259. random_order = True
) 3r31259. 3r31259. im = np.zeros ((1? 5? 10? 3), dtype = np.uint8)
for c in range (? 16): 3r31259. im[c]= image
3r31259. for im in range (len (grid)):
misc.imsave (aug_path + "/" + filename + "_" + str (im) + ".jpg", grid[im]) 3r31218. 3r31115. 3r31247. 3r31259. This script uses the
method. main
with three cycles for
- one for each category of images. In each iteration, in each of the cycles, we call the method. draw_single_sequential_images
: the first argument is the file name, the second is the path, the third is the directory where to save the result. 3r31247. 3r31259. 3r31247. 3r31259. After that, we read the image from the disk and apply a number of transformations to it. I have documented most of the transformations in the above code snippet, so we will not repeat. 3r31247. 3r31259. 3r31247. 3r31259. For each image creates 16 other images. Here is an example of what they look like:3r31259. 3r31247. 3r31259. 3r33350. 3r31240. 3r31247. 3r31259. 3r31247. 3r31259. Please note that in the above script, we scale the images to
100x56 3r31218. pixels We do this to reduce the amount of data and, accordingly, the number of calculations that our model performs during training and evaluation. 3r31247. 3r31259. 3r31247. 3r31259. 3r31223. Building a model 3r31224. 3r31247. 3r31259. Now we will build a model for classification! 3r31247. 3r31259. 3r31247. 3r31259. Since we are dealing with images, we use the convolutional neural network (CNN). This network architecture is known to be suitable for image recognition, object detection and classification. 3r31247. 3r31259. 3r31247. 3r31259. Transfer learning 3r3-31205. 3r31247. 3r31259. The image below shows the popular CNN VGG-16 used to classify images. 3r31247. 3r31259. 3r31247. 3r31259. 3r33382. 3r31247. 3r31259. 3r31247. 3r31259. The neural VGG-16 recognizes 1000 image classes. It has 16 layers (not counting the pooling layers and the output). This multi-layer network is difficult to train in practice. This will require a large data set and many hours of training. 3r31247. 3r31259. 3r31247. 3r31259. The hidden layers of the trained CNN recognize the various elements of the images from the training set, starting at the edges, moving on to more complex elements, such as figures, individual objects, and so on. A trained CNN-style VGG-16 for recognizing a large set of images should have hidden layers that have learned many features from the training set. Such signs will be common to most images and, accordingly, be reused for different tasks. 3r31247. 3r31259. 3r31247. 3r31259. Transfer training allows you touse existing and trained network. We can take the output from any of the layers of the existing network and transmit it as input to the new neural network. Thus, by teaching a newly created neural network, over time, it can be taught to recognize new features of a higher level and correctly classify images from classes that the original model has never seen before. 3r31247. 3r31259. 3r31247. 3r31259. 3r33399. 3r33400. 3r31255. 3r31247. 3r31259. 3r31247. 3r31259. For our purposes, let's take the MobileNet neural network from the package. @ tensorflow-models /mobilenet . MobileNet is as powerful as VGG-1? but it is much smaller, which speeds up forward propagation, that is, network activation (forward propagation), and reduces download time in the browser. MobileNet trained on a dataset for image classification 3r3408. ILSVRC-2012-CLS
. 3r31247. 3r31259. 3r31247. 3r31259. When developing a model with the transfer of training, we have choices for two points: 3r31247. 3r31259. 3r31247. 3r31259. 3r33418. 3r31259. 3r31194. The output from which layer of the source model to use as input for the target model. 3r31195. 3r31259. 3r31194. How many layers of the target model are we going to train, if any. 3r31195. 3r31259. 3r33426. 3r31247. 3r31259. The first moment is very significant. Depending on the selected layer, we will get the signs at a lower or higher level of abstraction as input for our neural network. 3r31247. 3r31259. 3r31247. 3r31259. We are not going to train any layers of MobileNet. Select the output from 3r31217. global_average_pooling2d_1 and pass them on as input to our tiny model. Why did I choose this particular layer? Empirically. I did some tests, and this layer works quite well. 3r31247. 3r31259. 3r31247. 3r31259. The definition of the model
3r31247. 3r31259. The initial task was to classify the image into three classes: hand, foot and other movements. Let's first solve the problem of a smaller one: determine whether there is a hand strike in the frame or not. This is a typical binary classification task. For this purpose, we can define the following model:3r31259. 3r31247. 3r31259. 3r31101. 3r31031. import * as tf from '@ tensorflow /tfjs'; 3r31259. 3r31259. const model = tf.sequential (); 3r31259. model.add (tf.layers.inputLayer ({inputShape:[1024]})); 3r31259. model.add (tf.layers.dense ({units: 102? activation: 'relu'})); 3r31259. model.add (tf.layers.dense ({units: ? activation: 'sigmoid'})); 3r31259. model.compile ({
optimizer: tf.train.adam (1e-6),
loss: tf.losses.sigmoidCrossEntropy,
metrics:['accuracy']
}); 3r31218. 3r31115. 3r31247. 3r31259. This code defines a simple model, the 3r31217 layer. 1024 3r31218. units and activation
ReLU
, and also one output unit, which passes through the activation function sigmoid
. The latter gives the number from 3r31217. 0 up to 3r31217. 1 3r31218. , depending on the probability of the presence of a hand strike in a given frame. 3r31247. 3r31259. 3r31247. 3r31259. Why I chose 1024 3r31218. unit for the second level and learning rate 3r3r1217. 1e-6 3r31218. ? Well, I tried several different options and saw that such parameters work best. The “spear method” does not seem to be the best approach, but to a large extent, this is how setting up hyper parameters in deep learning works - based on our understanding of the model, we use intuition to update orthogonal parameters and empirically check how the model works. 3r31247. 3r31259. 3r31247. 3r31259. Method compile
compiles the layers together, preparing a model for learning and evaluation. Here we announce that we want to use the optimization algorithm 3r31217. adam 3r31218. . We also announce that we will calculate the loss (loss) from the cross entropy, and indicate that we want to evaluate the accuracy of the model. Then TensorFlow.js calculates the accuracy by the formula: 3r31224. 3r31259. 3r31247. 3r31259. 3r31217. Accuracy = (True Positives + True Negatives) /(Positives + Negatives)
3r31247. 3r31259. 3r31247. 3r31259. If you transfer training from the original MobileNet model, you first need to download it. Since it is impractical to train our model on more than 3000 images in the browser, we will apply Node.js and load the neural network from the file. 3r31247. 3r31259. 3r31247. 3r31259. You can download MobileNet here is . The directory is a file model.json
which contains the model architecture - layers, activations, etc. The remaining files contain the parameters of the model. You can load a model from a file using this code:3r31259. 3r31247. 3r31259. 3r31101. 3r31031. export const loadModel = async () => {
const mn = new mobilenet.MobileNet (? 1); 3r31259. mn.path = `file: //PATH /TO /model.json`; 3r31259. await mn.load (); 3r31259. return (input): tf.Tensor1D =>
mn.infer (input, 'global_average_pooling2d_1')
.reshape ([1024]); 3r31259.}; 3r31218. 3r31115. 3r31247. 3r31259. Please note that in method
loadModel
we return a function that takes a one-dimensional tensor as input and returns 3r3-31217. mn.infer (input, Layer) . Method infer
takes a tensor and a layer as arguments. The layer determines which hidden layer we want to get output from. If you open model.json and look for 3r31217. global_average_pooling2d_1 , then you will find such a name in one of the layers. 3r31247. 3r31259. 3r31247. 3r31259. Now you need to create a data set for learning the model. To do this, we must skip all the images through the method. infer
in MobileNet and assign them tags: 3r31217. 1 3r31218. for images with bumps and 0
for images without impact:3r31259. 3r31247. 3r31259. 3r31101. 3r31031. const punches = require ('fs')
.readdirSync (Punches)
.filter (f => f.endsWith ('. jpg'))
.map (f => `$ {Punches} /$ {f}`); 3r31259. 3r31259. const others = require ('fs')
.readdirSync (Others)
.filter (f => f.endsWith ('. jpg'))
.map (f => `$ {Others} /$ {f}`); 3r31259. 3r31259. const ys = tf.tensor1d (
new Array (punches.length) .fill (1)
.cat (new Array (others.length) .fill (0))); 3r31259. 3r31259. const xs: tf.Tensor2D = tf.stack (
punches
.map ((path: string) => mobileNet (readInput (path)))
.cat (others.map ((path: string) => mobileNet (readInput (path))))
) as tf.Tensor2D; 3r31218. 3r31115. 3r31247. 3r31259. In the code above, we first read the files in the directories with and without hitting. Then we determine the one-dimensional tensor containing the output labels. If we have 3r31217. n images with hits and
m 3r31218. other images in the tensor will be n
items with a value of 1 and m 3r31218. elements with a value of 0.
3r31259. 3r31247. 3r31259. In 3r31217. xs
we add the results of calling infer
for individual images. Notice that for each image we call the method. readInput
. Here is its implementation:
3r31259. 3r31247. 3r31259. 3r31101. 3r31031. export const readInput = img => imageToInput (readImage (img), TotalChannels); 3r31259. 3r31259. const readImage = path => jpeg.decode (fs.readFileSync (path), true); 3r31259. 3r31259. const imageToInput = image => {
const values = serializeImage (image); 3r31259. return tf.tensor3d (values,[image.height, image.width, 3], 'int32'); 3r31259.}; 3r31259. 3r31259. const serializeImage = image => {
const totalPixels = image.width * image.height; 3r31259. const result = new Int32Array (totalPixels * 3); 3r31259. for (let i = 0; i < totalPixels; i++) {
results result;
};
3r31259. 3r31217. readInput first calls
readImage
functions, and then delegates its call to r3r31217. imageToInput . Function readImage
reads the image from the disk and then decodes the jpg from the buffer using the package. jpeg-js . In 3r31217. imageToInput we transform the image into a three-dimensional tensor. 3r31247. 3r31259. 3r31247. 3r31259. As a result, for each i 3r31218. from 3r31217. 0
up to 3r31217. TotalImages should be ys3r31218. equals 1 3r31218. if xs[i]3r31218. corresponds to the image with a stroke, and 0
otherwise. 3r31247. 3r31259. 3r31247. 3r31259. 3r31223. Training model 3r31247. 3r31259. Now the model is ready to learn! Call the method fit
: 3r31247. 3r31259. 3r31247. 3r31259. 3r31101. 3r31031. await model.fit (xs, ys, {3r31259. epochs: Epochs,
batchSize: parseInt (((punches.length + others.length) * BatchSize) .toFixed (0)),
callbacks: 3r3r3r3ed3), to_fixed (0)),
callbacks: {3r3r3r3r3ed3) : async (_, logs) => {3r31259. console.log ('Cost:% s, accuracy:% s', logs.loss.toFixed (5), logs.acc.toFixed (5));
await tf.nextFrame (); 3r31259.}
}
}); 3r31218. 3r31115. 3r31247. 3r31259. The above code calls fit
with three arguments: xs
, ys and configuration object. In the configuration object, we set how many epochs the model will learn, the packet size, and the callback that TensorFlow.js will generate after processing each packet. 3r31247. 3r31259. 3r31247. 3r31259. The size of the package determines xs
and 3r31217. ys
for learning models in one era. For each epoch, TensorFlow.js will select a subset of 3r3-31217. xs
and the corresponding elements of 3r31217. ys
, performs a direct distribution, gets the output of the layer with the activation of 3r31217. sigmoid , and then, based on the loss, performs optimization using the algorithm 3r31217. adam 3r31218. . 3r31247. 3r31259. 3r31247. 3r31259. After launching the training script, you will see a result similar to the one below: 3r31247. 3r31259. 3r31247. 3r31259. 3r31101. Cost: ???? accuracy: ???r3r31259. eta = 0.3> ---------- acc = ??? loss = ??? Cost: ???? accuracy: ???r31259. eta = 0.2 => --------- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.2 ==> -------- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.2 ===> ------- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.2 ====> ------ acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.1 =====> ----- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.1 ======> ---- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.1 =======> --- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.1 ========> - acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.0 ==========>293ms 14675us /step - acc = ??? loss = ???r3r31259. Epoch 3/50
Cost: ???? accuracy: ???r3r31259. eta = 0.3> ---------- acc = ??? loss = ??? Cost: ???? accuracy: ???r31259. eta = 0.3 => --------- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.3 ==> -------- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.2 ===> ------- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.2 ====> ------ acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.2 =====> ----- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.1 ======> ---- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.1 =======> --- acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.1 ========> - acc = ??? loss = ??? Cost: ???? accuracy: ???r3r31259. eta = 0.0 ==========>
304ms 15221us /step - acc = ??? loss = ???r3r31115. 3r31247. 3r31259. Notice how accuracy increases over time, and loss decreases. 3r31247. 3r31259. 3r31247. 3r31259. On my data set, the model after training showed an accuracy of 92%. Keep in mind that accuracy may not be very high due to a small set of training data. 3r31247. 3r31259. 3r31247. 3r31259. 3r31223. Running the model in the browser 3r31224. 3r31247. 3r31259. In the previous section, wetrained in a binary classification model. Now run it in the browser and connect it to the game 3r31239. MK.js ! 3r31247. 3r31259. 3r31247. 3r31259. 3r31101. 3r? 31102. const video = document.getElementById ('cam'); 3r31259. const Layer = 'global_average_pooling2d_1'; 3r31259. const mobilenetInfer = m => (p): tf.Tensor => m.infer (p, Layer); 3r31259. const canvas = document.getElementById ('canvas'); 3r31259. const scale = document.getElementById ('crop'); 3r31259. 3r31259. const ImageSize = {
Width: 10?
Height: 56 3r31259.}; 3r31259. 3r31259. navigator.mediaDevices
.getUserMedia ({
video: true,
audio: false
})
.then (stream => {3r31259. video.srcObject = stream; 3r3p31259.}); 3r31218. 3r31115. 3r31247. 3r31259. In the code above, there are several declarations: 3r31247. 3r31259. 3r31247. 3r31259. 3r31186. 3r31259. 3r31194. 3r31217. video 3r31218. contains a link to
HTML5 video 3r31218. on page 3r31195. 3r31259. 3r31194. 3r31217. Layer 3r31218. contains the name of the layer from MobileNet, from which we want to get the output and send it as input to our model 3r31195. 3r31259. 3r31194. 3r31217. mobilenetInfer
- A function that accepts a MobileNet instance and returns another function. The return function takes the input data and returns the corresponding output from the specified MobileNet layer 3r31195. 3r31259. 3r31194. 3r31217. canvas points to the element 3r31217. HTML5 canvas which we will use to extract frames from the 3r31195 video. 3r31259. 3r31194. 3r31217. scale - another 3r31217. canvas which is used to scale individual frames 3r31195. 3r31259. 3r31197. 3r31247. 3r31259. After that we get the video stream from the user's camera and set it as the source for the element 3r31217. video 3r31218. . 3r31247. 3r31259. 3r31247. 3r31259. The next step is to implement a grayscale filter that takes 3r31217. canvas and converts its contents:3r31259. 3r31247. 3r31259. 3r31101. 3r31031. const grayscale = (canvas: HTMLCanvasElement) => {
const imageData = canvas.getContext ('2d'). getImageData (? ? canvas.width, canvas.height); 3r31259. const data = imageData.data; 3r31259. for (let i = 0; i < data.length; i += 4) {
const avg = (data[i]+ data[i + 1]+ data[i + 2]) /3;
data[i]= avg;
data w2w2212. = avg;
}
canvas.getContext ('2d'). putImageData (imageData, ? 0); 3r31218.}; 3r31218. 3r31115.
3r31259. As a next step, link the model with MK.js:
3r31259. 3r31247. 3r31259. 3r31101. 3r31031. let mobilenet: (p: any) => tf.Tensor ; 3r31259. tf.loadModel ('http: //localhost: 5000 /model.json') .then (model => {
mobileNet
.load ()
.then ((mn: any) = mobilenet = mobilenetInfer ( mn)) 3r31259. .then (startInterval (mobilenet, model)); 3r31259.}); 3r31218. 3r31115. 3r31247. 3r31259. In the code above, we first load the model that was trained above, and then load MobileNet. Passing MobileNet to method 3r31217. mobilenetInfer to get a path to calculate the output from the hidden network layer. After that, call the
method. startInterval
with two networks as arguments. 3r31247. 3r31259. 3r31247. 3r31259. 3r31101. 3r31031. const startInterval = (mobilenet, model) => () => {setInterval (() => {3r31259. canvas.getContext ('2d'). drawImage (video, ? 0); 3r?13?159. 3r?15?159. grayscale (scale
.getContext ('2d')
.dr3. canvas, ? ? canvas.width,
canvas.width /(ImageSize.Width /ImageSize.He2),
? ? ImageSize.Width, ImageSize.Height . = Array.from ((
Model.predict (mobilenet (tf.fromPixels (scale))) as tf.Tensor1D)
.DataSync () as Float32Array);
.Detect; 3r31259. If (punching> = 0.4) detect && detect.onPunch (); 3r31259. 3r31259.}, 100); 3r31259.}; 3r31218. 3r31115. 3r31247. 3r31259. The most interesting begins in method
startInterval
! First, we run the interval, where every 100ms
call an anonymous function. It is first over 3r31217. canvas Renders video with current frame. Then we reduce the frame size to 100x56 3r31218. and apply a gray shade filter to it. 3r31247. 3r31259. 3r31247. 3r31259. The next step is to transfer the frame to MobileNet, get the output from the desired hidden layer and transfer it as input to the method. predict 3r3-31218. our model. That returns a tensor with one element. Using dataSync
we get the value from the tensor and assign it to the constant punching 3r31218. . 3r31247. 3r31259. 3r31247. 3r31259. Finally, check: if the probability of hitting a hand exceeds ???r31218. then call the method. onPunch
global object 3r31217. Detect 3r31218. . MK.js provides a global object with three methods: 3r31217. onKick
, 3r31217. onPunch
and 3r31217. onStand
that we can use to control one of the characters. 3r31247. 3r31259. 3r31247. 3r31259. Done! Here is the result! 3r31247. 3r31259. 3r31247. 3r31259. 3r33937. 3r31247. 3r31259. 3r31247. 3r31259. 3r31223. Recognition of kicks and kicks with the N-tric classification 3r31247. 3r31259. In the next section, we will make a smarter model: a neural network that recognizes punches, kicks, and other images. This time, let's start with the preparation of a training set: 3r31247. 3r31259. 3r31247. 3r31259. 3r31101. 3r31031. const punches = require ('fs')
.readdirSync (Punches)
.filter (f => f.endsWith ('. jpg'))
.map (f => `$ {Punches} /$ {f}`); 3r31259. 3r31259. const kicks = require ('fs')
.readdirSync (Kicks)
.filter (f => f.endsWith ('. jpg'))
.map (f => `$ {Kicks} /$ {f}`); 3r31259. 3r31259. const others = require ('fs')
.readdirSync (Others)
.filter (f => f.endsWith ('. jpg'))
.map (f => `$ {Others} /$ {f}`); 3r31259. 3r31259. const ys = tf.tensor2d (
new Array (punches.length)
.fill ([1, 0, 0]) 3r3r?125. (others.length) .fill ([0, 0, 1])),
[punches.length + kicks.length + others.length, 3]
); 3r31259. 3r31259. const xs: tf.Tensor2D = tf.stack (
punches
.map ((path: string) => mobileNet (readInput (path)))
.cat (kicks.map ((path: string) => mobileNet (readInput (path))))
.cat (others.map ((path: string) => mobileNet (readInput (path))))
) as tf.Tensor2D; 3r31218. 3r31115. 3r31247. 3r31259. As before, we first read the catalogs with images of punches by hand, foot, and other images. After that, unlike last time, we form the expected result in the form of a two-dimensional tensor, and not a one-dimensional one. If we have 3r3393988. n hand strike pictures, m pictures with a kick and k other images, then in the tensor ys
will be n
elements with a value of 3r31217.[1, 0, 0]3r31218. , 3r31217. m 3r31218. elements with a value of 3r31217.[0, 1, 0]3r31218. and 3r31217. k 3r31218. elements with a value of 3r31217.[0, 0, 1]3r31218. . 3r31247. 3r31259. 3r31247. 3r31259. Vector from 3r31217. n
elements in which n - 1
elements with a value of 3r31217. 0 and one element with the value 1 3r31218. , we call the unitary vector (one-hot vector). 3r31247. 3r31259. 3r31247. 3r31259. After that we form the input tensor xs
by folding the output of each image from MobileNet. 3r31247. 3r31259. 3r31247. 3r31259. There will have to update the definition of the model:
3r31259. 3r31247. 3r31259. 3r31101. 3r31031. const model = tf.sequential (); 3r31259. model.add (tf.layers.inputLayer ({inputShape:[1024]})); 3r31259. model.add (tf.layers.dense ({units: 102? activation: 'relu'})); 3r31259. model.add (tf.layers.dense ({units: ? activation: 'softmax'})); 3r31259. await model.compile ({
optimizer: tf.train.adam (1e-6),
loss: tf.losses.sigmoidCrossEntropy,
metrics:['accuracy']
}); 3r31218. 3r31115. 3r31247. 3r31259. The only two differences from the previous model are: 3r31247. 3r31259. 3r31247. 3r31259. 3r31186. 3r31259. 3r31194. The number of units in the output layer 3r31195. 3r31259. 3r31194. Activations in the output layer 3r31195. 3r31259. 3r31197. 3r31247. 3r31259. There are three units in the output layer, because we have three different categories of images:
3r31259. 3r31247. 3r31259. 3r31186. 3r31259. 3r31194. Hand strike 3r31259. 3r31194. Kick foot 3r31259. 3r31194. The other 3r31259. 3r31197. 3r31247. 3r31259. On these three units, an activation of r3r31217 is triggered. softmax 3r31218. which transforms their parameters to a three-value tensor. Why three units for the output layer? Each of the three values for the three classes can be represented by two bits: 00
, 3r31217. 01
, 3r31217. 10 . The sum of the tensor values created by softmax 3r31218. is equal to ? that is, we will never get 0? so we will not be able to classify images of one of the classes. 3r31247. 3r31259. 3r31247. 3r31259. After training the model for 500 3r31218. Era I achieved an accuracy of about 92%! This is not bad, but do not forget that the training was conducted on a small set of data. 3r31247. 3r31259. 3r31247. 3r31259. The next step is to launch the model in the browser! Since logic is very similar to launching a model for binary classification, take a look at the last step, where an action is chosen based on the model output:
3r31259. 3r31247. 3r31259. 3r31101. 3r? 31102. const[punch, kick, nothing]= Array.from ((model.predict (
Mobilenet (tf.fromPixels (scaled)) r3r31259.) As tf.Tensor1D) .dataSync () as Float32Array); 3r31259. 3r31259. const detect = (window as any) .Detect; 3r31259. if (nothing> = 0.4) return; 3r31259. 3r31259. if (kick> punch && kick> = ???) {
detect.onKick (); 3r31259. return; 3r31259.}
if (punch> kick && punch> = ???) detect.onPunch (); 3r31218. 3r31115. 3r31247. 3r31259. First we call MobileNet with a reduced frame in shades of gray, then we transfer the result to our trained model. The model returns a one-dimensional tensor, which we convert to Float32Array
from 3r31217. dataSync
. In the next step, use r3r31217. Array.from
to cast a typed array to a jаvascript array. Then we extract the probabilities that there is a punch, a kick or nothing on the frame. 3r31247. 3r31259. 3r31247. 3r31259. If the probability of the third result exceeds 3r31217. ???r31218. come back. Otherwise, if the probability of kicking above ???r31218. , send a kick command to MK.js. If the probability of hitting a hand above ???r31218. and higher the probability of kicking, then send the action of the punch. 3r31247. 3r31259. 3r31247. 3r31259. In general, that's all! The result is shown below:
3r31259. 3r31247. 3r31259. 3r31142. 3r31247. 3r31259. 3r31247. 3r31259. 3r31223. Action Recognition 3r31224. 3r31247. 3r31259. If you collect a large and diverse set of data about people who hit with arms and legs, then you can build a model that works perfectly on individual frames. But is that enough? What if we want to go even further and distinguish two different types of kicks: with a turn and from the back (back kick). 3r31247. 3r31259. 3r31247. 3r31259. As can be seen in the frames below, at a certain point in time from a certain angle, both blows look the same:
3r31259. 3r31247. 3r31259. 3r31159. 3r31247. 3r31259. 3r31247. 3r31259. 3r3r1164. 3r31247. 3r31259. 3r31247. 3r31259. But if you look at the performance, the movements are completely different: 3r31247. 3r31259. 3r31247. 3r31259. 3r31173. 3r31247. 3r31259. 3r31247. 3r31259. How to train the neural network to analyze the sequence of frames, and not one frame? 3r31247. 3r31259. 3r31247. 3r31259. For this purpose, we can explore another class of neural networks, called recurrent neural networks (RNN). For example, RNN is great for working with time series: 3r3-31247. 3r31259. 3r31247. 3r31259. 3r31186. 3r31259. 3r31194. Natural language processing (NLP), where each word depends on the preceding and following r3r31195. 3r31259. 3r31194. Predict the next page based on the browsing history of 3r31259. 3r31194. Recognizing the action in a sequence of frames 3r31259. 3r31197. 3r31247. 3r31259. Implementing such a model is beyond the scope of this article, but let's look at an example of architecture to get an idea of how all this will work together. 3r31247. 3r31259. 3r31247. 3r31259. The power of the RNN
3r31247. 3r31259. The diagram below shows the action recognition model:
3r31259. 3r31247. 3r31259. 3r31212. 3r31247. 3r31259. 3r31247. 3r31259. Take the last 3r31217. n
frames from the video and transfer them to CNN. The CNN output for each frame is passed as input to the RNN. The recurrent neural network will determine the dependencies between the individual frames and recognize which action they correspond to. 3r31247. 3r31259. 3r31247. 3r31259. 3r31223. Conclusion
3r31247. 3r31259. In this article, we developed a model for grading images. For this purpose, we collected a data set: extracted video frames and manually divided them into three categories. Then the data was augmented by adding images using 3r31227. imgaug
. 3r31247. 3r31259. 3r31247. 3r31259. After that, we explained what learning transfer is and used the MobileNet trained model from the 3r31233 package for our own purposes. @ tensorflow-models /mobilenet
. We loaded MobileNet from a file in the Node.js process and trained an extra dense layer where data was fed from the hidden MobileNet layer. After training, we have reached an accuracy of more than 90%! 3r31247. 3r31259. 3r31247. 3r31259. To use this model in the browser, we downloaded it together with MobileNet and ran a frame categorization from the user's webcam every 100 ms. We connected the model with the game MK.js and used the output of the model to control one of the characters. 3r31247. 3r31259. 3r31247. 3r31259. Finally, we looked at how to improve the model by combining it with a recurrent neural network to recognize actions. 3r31247. 3r31259. 3r31247. 3r31259. I hope you enjoyed this tiny project as much as I did!
3r31259. 3r31259. 3r31252. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r31253. 3r31259. 3r31255. 3r31259. 3r31259. 3r31259. 3r31259.
It may be interesting
weber
Author28-10-2018, 22:37
Publication DateGame development / Image processing / Machine learning
Category- Comments: 0
- Views: 377
mike tomlin
raymond weber
noorseo
nushra45
jahangirkhatri