Detection of faces on video: Raspberry Pi and Neural Compute Stick
About a year ago the company Intel Movidius released a device for the effective inference of convolutional neural networks - Movidius Neural Compute Stick (NCS). This device allows you to use neural networks to recognize or detect objects in conditions of limited power consumption, including in robotics tasks. NCS has a USB interface and consumes no more than 1 watt. In this article I will talk about the experience of using NCS with Raspberry Pi for the task of detecting faces in video, including both training the Mobilenet-SSD detector and its launching on Raspberry.
All the code can be found in my two repositories: training of the detector and demo with the detection of faces .
In his first article I already wrote about the detection of faces with the help of NCS: then it was about YOLOv2 detector, which I converted from the format Darknet in the format Caffe , and then ran on the NCS. The conversion process turned out to be non-trivial: since these two formats differently set the last layer of the detector, the output of the neural network had to be parsed separately, on the CPU, using a piece of code from Darknet. In addition, this detector did not satisfy me both in speed (up to 5.1 FPS on my laptop) and in accuracy - later I was convinced that because of the sensitivity to image quality from it it is difficult to get a good result on Raspberry Pi.
In the end, I decided to just train my detector. The choice fell on SSD detector with Mobilenet encoder: lightweight convolutions Mobilenet can achieve high speed without much loss in quality, and the SSD detector is not inferior to YOLO and works on the NCS out of the box.
How does the Mobilenet-SSD detector [/b] work?
Let's start with Mobilenet. In this architecture, the complete
convolution (through all channels) is replaced by two lightweight convolutions: first
separately for each channel, and then a full
convolution. After each convolution, is used. BatchNorm and nonlinearity (ReLU). The very first network convolution, which receives an image, is usually left complete. This architecture makes it possible to significantly reduce the complexity of computations due to a slight decrease in the quality of predictions. There are also more advanced version , but I have not tried it yet.
SSD (Single Shot Detector) works as follows: two
are hung on the outputs of several encodings of the encoder.
convolutional layer: one predicts the probability of classes, the other - the coordinates of the bounding box. There is also a third layer, which gives the coordinates and positions of the default frames at the current level. The meaning is this: the output of any layer is naturally divided into cells; closer to the end of the neural network, they are getting smaller (in this case, due to bundles with
? stride = 2 ), and the field of visibility of each cell increases. For each cell on each of several selected layers, we specify several default frames of different sizes and with different aspect ratios, and use additional convolutional layers to correct the coordinates and predict the probability of classes for each such frame. Therefore, an SSD detector (just like YOLO) always considers the same number of frames. The same object can be detected on different layers: during training, the signal is sent to all frames that intersect quite strongly with the object, and during the application of the detection they are combined using the non maximum suppression (NMS). The final layer unites the detections from all layers, finds their full coordinates, cuts off the threshold of probability and produces NMS.
The code for learning the detector is located here .
I decided to use ready Mobilenet-SSD detector , trained at PASCAL VOC0712 , and to do it in order to find people. Firstly, it helps a lot to teach the grid faster, and secondly, you do not have to reinvent the wheel.
The original project included the script
gen.py, which literally collected
.prototxtThe model file, substituting the input parameters. I transferred it to my project, slightly extending the functionality. This script allows you to generate four types of configuration files:
- train : at the entrance - a training LMDB base, on the output there is a layer with calculation of the loss function and its gradients, there is BatchNorm
- test : at the input there is a test LMDB base, on the output there is a layer with a mean average precision, there is BatchNorm
- deploy : at the input there is an image, on the output there is a layer with predictions, BatchNorm is missing
- deploy_bn : at the input there is an image, on the output there is a layer with predictions, there is BatchNorm
The latter option I added later, so that in scripts you could load and convert the grid from BatchNorm, without touching the LMDB database - otherwise, in the absence of the database, nothing worked. (In general, it seems strange to me that in Caffe the data source is specified in the network architecture - this is at least not very practical).
The complete convolution is conv0 : 32 channels,
stride = 2
Mobilenet convolution conv1 - conv11 : 6? 12? 12? 25? 25? 512 512 channels, some have
stride = 2
Mobilenet convolution conv1? conv13 : 1024 channels, conv12 has
stride = 2
Complete convolutions conv14_? conv14_2 : 25? 512 channels, in the first
kernel_size = 1, the second
stride = 2
Complete convolutions conv15_? conv15_2 : 12? 256 channels, in the first
kernel_size = 1, the second
stride = 2
Complete convolutions conv16_? conv16_2 : 12? 256 channels, in the first
kernel_size = 1, the second
stride = 2
Complete convolutions conv17_? conv17_2 : 6? 128 channels, at the first
kernel_size = 1, the second
stride = 2
The final layer is Detection output
I slightly corrected the architecture of the network. The list of changes:
- Obviously, the number of classes changed to 1 (not counting the background).
- Limitations on the ratio of the sides of the cut out patches during training: changed from on (I decided to simplify the task a little and not learn too much pictures).
- From the default framework, only the square ones remained, two for each cell. I greatly reduced their sizes, because faces are significantly smaller than objects in the classical problem of object detection.
Caffe calculates the size of the default frames as follows: having a minimum frame size of and the maximum is , it creates a small and large frame with dimensions of and . Since I wanted to detect as small faces as possible, I calculated the full
stridefor each layer of detection and equated the minimum size of the frame to it. With such parameters, small default frames will be located close to each other and will not intersect. So at us at least there is a guarantee, that crossing with object will exist for any frame. I set the maximum size twice. For the layers conv16_? conv17_2 I set the size by eye, the same. Therefore, for all layers were:
I used two datasets: WIDER Face and FDDB . WIDER contains a lot of pictures with very small and blurry faces, and FDDB tends more towards large images of faces (and an order of magnitude smaller than WIDER). They slightly differ in the annotation format, but these are details.
For the training, I did not use all the dаta: I threw out too small faces (less than six pixels or less than 2% of the width of the image), threw out all the images with a ratio of less than 0.5 or greater than ? threw out all the images marked as "blurry" in the WIDER dataset, because they corresponded for the most part to very small individuals, and I had to somehow even out the ratio of small and large persons. After that I made all the frames square, expanding the smallest side: I decided that I was not very interested in the proportions of the face, and the task for the neural network is a little simplified. Also I threw out all the black and white images, which were few, and on which the database build script falls.
To use them for training and testing, you need to build an LMDB database from them. How it's done:
- For each image, markup is created in
file is created. train.txtwith lines of the form
"path /to /image.png path /to /labels.xml", the same is created for test.
file is created. test_name_size.txtwith lines of the form
"test_image_name height width"
file is created. labelmap.prototxtwith numerical correspondences to labels
is started. ssd-caffe /scripts /create_annoset.py(example from the Makefile):
python3 /opt/movidius/ssd-caffe/scripts/create_annoset.py --anno-type = detection
--label-map-file = $ (wider_dir) /labelmap.prototxt --min-dim = 0 --max-dim = 0
--resize-width = 0 --resize-height = 0 --check-label --encode-type = jpg --encoded
--redo $ (wider_dir)
$ (wider_dir) /trainval.txt $ (wider_dir) /WIDER_train /lmdb /wider_train_lmdb ./data
<?xml version="1.0" ?>
Colab Notebook .
I did the training in Google Colaboratory, since my laptop barely managed to test the grid, and on trainingIn general, he was hanging out. Colaboratory allowed me to train the grid fast enough and for free. The only catch is that I had to write a script for compiling SSD-Caffe for Colaboratory (which includes such strange things as boost compilation and source editing), which runs for about 40 minutes. For more information, see in my previous publication .
Colaboratory has one more feature: after 12 hours the machine dies, irrevocably erasing all the data. The best way to avoid data loss is to mount your Google disk into the system and store the network weights every 500-1000 iterations of learning.
As for my detector, for one session at the Colaboratory, he managed to unlearn ?500 iterations, and fully trained for two sessions.
The mean average precision on the test set I selected (merged WIDER and FDDB with the constraints listed earlier) was about ??? for the best model. To measure mAP on the scales saved during training, there is a script
Detector work on a (very strange) example from the dataset:
Running on the NCS
The demo program with the detection of persons is here .
To compile a neural network for a Neural Compute Stick, you need Movidius NCSDK : it contains utilities for compiling and profiling neural networks, as well as C ++ and Python API. It's worth noting that a second version was recently released that was incompatible with the first one: all API functions were renamed for some reason, the internal format of neural networks was changed, FIFO was added for interaction with NCS and (finally) an automatic conversion from float 32 bit to float 16 bit, which is so lacking in C ++. I updated all my projects to the second version, but left a pair of crutches for compatibility with the first one.
After learning the detector it is worth draining the BatchNorm layers with neighboring convolutions to speed up the neural network. This is the script
merge_bn.pyhence , which I also borrowed from the project Mobilenet-SSD.
Then you need to call the utility
mvNCCompile, for example:
mvNCCompile -s 12 -o graph_ssd -w ssd-face.caffemodel ssd-face.prototxt
In the project's Makefile for this purpose, there is a goal of
graph_ssd. The resulting file is
graph_ssdis a description of the neural network in a format that the NCS understands.
Now about how to interact with the device itself. The process is not very complicated, but requires a fairly large amount of code. The sequence of actions is approximately the following:
Get the device descriptor by the order number
Open the device
Read the compiled file of the neural network into the buffer (as a binary file)
Create an empty calculation graph for NCS
Place the graph on the device using the data from the file, and select FIFO for it on the input /output; The buffer with the contents of the file can now be freed
Get the image from the camera (or from any other source)
Process it: scaled to the desired size, converted to float32 and resulted in a range of[-1,1]
Upload an image to the device and request the inference.
Request the result (the program will be blocked until the result is obtained)
Resolve the result, select the frame of objects (about the format - further)
Display an image with predictions
Release all resources: delete the FIFO and calculation graph, close the device and delete its handle
Practically for every action with NCS there is a separate function, and in C ++ it looks rather cumbersome, while you have to closely monitor the release of all resources. In order not to load code, I created class-wrapper for working with NCS . In it, all the initialization work is hidden in the constructor and the function
load_file, and to free resources - to the destructor, and working with NCS is reduced to calling 2-3 methods of the class. In addition, there is a convenient function for explaining the errors that have arisen.
Create a wrapper by passing the input size and output size (number of elements) to the designer:
NCSWrapper NCS (NETWORK_INPUT_SIZE * NETWORK_INPUT_SIZE * ? NETWORK_OUTPUT_SIZE);
We load the compiled file with neural network, in passing initializing everything that we need:
if (! NCS.load_file ("./models /face /graph_ssd"))
Convert the image to float32 (
.cv :: Matin
format .V__32FC3) And load to the device:
if (! NCS.load_tensor_nowait ((float *) image.data))
pointer, the result buffer is supported by the wrapper); before the end of calculations the program is blocked:
We get the result (
result- it's free
if (! NCS.get_result (result))
In fact, there is a method in the wrapper that allows you to load the data and get the result at the same time:
load_tensor ((float *) image.data, result). I refused to use it for a reason: using separate methods, you can slightly speed up the code execution. After loading the image, the CPU will remain idle until the result of the NCS (in this case it's about 100 ms) comes, and at this time you can do some useful work: read the new frame and convert it, and also display the previous detections . This is how the demo program is implemented, in my case it slightly increases the FPS. You can go ahead and run image processing and face detection asynchronously in two different threads - it really works and allows you to accelerate a little more, but the demo program is not implemented.
The detector returns a float array of size
as a result. 7 * (keep_top_k + 1). Here is
keep_top_k- the parameter specified in
.prototxtfile model and showing how many detections (in decreasing confidence) should be returned. This parameter, as well as the parameter responsible for filtering the detection by the minimum confidence value, and the parameters of non maximum suppression can be configured in
.prototxtmodel file in the very last layer. It is worth noting that if Caffe returns as many detections as the image was found, NCS always returns
keep_top_kdetection, so that the size of the array is constant.
The result array itself is arranged like this: if we consider it as a matrix with
keep_top_k + 1rows and 7 columns, then in the first line, in the first element there will be a number of detections, and starting from the second line will be the detections themselves in the format
"garbage, class_index, class_probability, x_min, y_min, x_max, y_max". The coordinates are in the range[0,1], so they need to be multiplied by the height /width of the image. The rest of the array will have garbage. In this case, the non maximum suppression is performed automatically, even before the result is obtained (it seems, right on the NCS).
void get_detection_boxes (float * predictions, int w, int h, float thresh,
.std :: vector & probs, std :: vector & boxes)
int num = predictions;
float score = 0;
float cls = 0;
for (int i = 1; i thresh && cls <=1)
boxes.push_back (Rect (predictions[i*7+3]* w, predictions[i*7+4]* h,
(predictions[i*7+5]-predictions[i*7+3]) * w,
(predictions[i*7+6]-predictions[i*7+4]) * h));
Features of launching on Raspberry Pi
The demo program itself can be launched either on an ordinary computer or laptop with Ubuntu, or on Raspberry Pi with Raspbian Stretch. I use Raspberry Pi 2 model B, but the demo should work on other models. The project's makefile contains two purposes for switching mode:
make switch_deskfor the computer /laptop and
make switch_rpifor Raspberry Pi. The principal difference in the code of the program is only that in the first case, OpenCV is used to read data from the camera, and in the second case - the library RaspiCam . To run the demo on Raspberry, you must compile and install it.
Now a very important point: installing NCSDK. If you follow the standard installation instructions on Raspberry Pi, nothing good will end: the installer will try to drag and compile SSD-Caffe and Tensorflow. Instead, NCSDK needs compile in API-only mode . In this mode only C ++ and Python API will be available (that is, it will be impossible to compile and profile the neural network graphs). This means that the neural network graph must first be compiled on an ordinary computer, and then copied to Raspberry. For convenience, I added two compiled files to the repository, for YOLO and for SSD.
Another interesting point is the purely physical connection of NCS to Raspberry. It would seem that it's easy to connect it to the USB connector, but you need to remember that its case will block the other three connectors at the same time (it's quite healthy, since it functions as a radiator). The easiest way is to connect it via a USB cable. But it is worth bearing in mind that the cable introduces an additional delay in data transmission - not very large, but noticeable. I tried two different cables, one 2 m, the second 30 cm, and they both contributed about the same delay.
Now about the power of the NCS. According to the documentation, it consumes up to 1 watt (at 5 volts on the USB connector it will be up to 200 ma; for comparison: the Raspberry camera consumes up to 250 ma). When powered by a normal charger of 5 volts, 2 amps, everything works fine. However, when you try to connect two or more NCS to Raspberry, you may experience problems. In this case, it is recommended to use a USB-splitter with the possibility of external power.
On the Raspberry demo, it runs slower than on the computer /laptop: 7.2 FPS versus 10.4 FPS. This is due to several factors: firstly, it is impossible to get rid of the computations on the CPU, but they are much slower; secondly, the speed of data transfer affects (remember the USB cable).
Also, for comparison, I tried to run the Raspberry YOLOv2 face detector from my first article, but it did not work very well: at a speed of 3.6 FPS, it misses a lot of faces even on simple frames. Apparently, he is very sensitive to the parameters of the input image, the quality of which in the case of the Raspberry camera is far from ideal. SSD works much more stable, although it was necessary to tweak the video settings in RapiCam settings a little. he also sometimes misses faces on the frame, but does it quite rarely. To increase stability in real applications, you can add a simple centroid tracker .
By the way: the same can be reproduced in Python, there is tutorial on PyImageSearch (Mobilenet-SSD is used for the object detection task).
I also experienced a couple of ideas on speeding up the neural network:
The first idea: you can leave only the detection of layers
conv13, and remove all the extra layers. The result is a detector that detects only small faces and works a little faster. In general, it is not worth it.
The second idea was interesting, but it did not work: I tried to throw convolutions from neural networks with weights close to zero, hoping that it would become faster. However, there were few such bundles, and their removal only slightly slowed the neural network (the only guesska: this is due to the fact that the number of channels ceased to be a power of two).
About the discovery of faces on Raspberry, I thought a long time, as a subtask of my robotic project. I did not like the classical detectors for the ratio of speed and quality, and I decided to try the neural network methods, at the same time I tested the Neural Compute Stick, as a result of which there were two projects on GitHub and three articles on Habr (including the current one). In general, the result suits me - most likely, this is the detector I will use in my robot (perhaps, there will be another article about it). It is worth noting that my decision may not be optimal - yet, this is a training project, done partly because of curiosity towards NCS. Nevertheless, I hope that this article will prove useful to somebody.
It may be interesting