Removing the background with the help of deep training
Translation Background removal with deep learning .
For the last several years of work in the field of machine learning, we wanted to create real products based on machine learning.
A few months ago, after passing an excellent course Fast.AI , the stars coincided, and we had such an opportunity. Modern advances in deep learning technologies have made it possible to accomplish much of what previously seemed impossible, new tools have emerged that have made the implementation process more accessible than ever.
We have set ourselves the following goals:
Improve our skills of working with in-depth training.
Improve our skills in implementing products based on AI.
Create a useful product with prospects in the market.
Have fun (and help to have fun for our users).
see example . However, completely automated background removal is a rather complex task, and as far as we know, there is still no product that has achieved acceptable results (although there are those who try
What background will we delete? This question was important, because the more specific the model in terms of objects, angles and other, the higher will be the quality of the separation of the background and the foreground. When we started our work, we thought broadly: a complex background removal tool that automatically identifies the front and back plane in each type of image. But after learning our first model, we realized that it's better to focus our efforts on a certain set of images. So we decided to focus on selfies and portraits of people.
Removing the background on a photo (almost) of a person.
Selfi is the image:
with a characteristic and oriented foreground (one or more "people"), which guarantees us a good separation between the object (face + upper body) and background,
and also with a constant angle and always the same object (person).
Given these assertions, we engaged in research and implementation, spending many hours on training to create an easy-to-use background removal service with a single click.
The main part of our work was to train the model, but we could not afford to underestimate the importance of proper implementation. Good segmentation models are still not as compact as the image classification models (for example, [/i] ? SqueezeNet ), And we have been actively studying the implementation options both on the server side and on the browser side.
If you want to read more about the process of implementing our product, you can get acquainted with our posts about the implementation on the side of server and on the side of client .
If you want to learn about the model and the process of learning it, keep reading here.
When studying the tasks of deep learning and computer vision, reminiscent of the tasks before us, it is easy to understand that the best option for us is the problem semantic segmentation .
There are other strategies, such as depth separation , but they seemed to us not mature enough for our purposes.
Semantic segmentation is a well-known task of computer vision, one of the three most important, along with the classification and detection of objects. Segmentation, in fact, is a classification problem, in the sense of the distribution of each pixel by classes. Unlike models of classification or image detection, the segmentation model does demonstrate some "understanding" of images, that is, not only says that "there is a cat in this image", but also at the pixel level indicates where this cat is.
So, how does segmentation work? To better understand, we will need to study some of the early works in this area.
The very first idea was to adapt some of the early classification networks, such as VGG and Alexnet. VGG (Visual Geometry Group) was in 2014 the advanced model for image classification, and even today it is very useful due to its simple and clear architecture. When studying the early layers of VGG, you can notice that there is a high activation for categorization. The deeper layers have even stronger activation, yet they are awful by nature due to the repeated pooling action. With all this in mind, it has been suggested that classroom learning can also be used by classifying, with some modifications.
Early results of semantic segmentation appeared along with classification algorithms. In this post is you can see some of the gross segmentation results obtained using VGG:
The results of deeper layers:
Segmentation of the bus image, light purple (29) is a class of the school bus.
After bilinear oversampling:
These results are obtained from a simple transformation (or maintenance) of a completely bound layer into its original form, preserving its spatial characteristics and obtaining a complete convolutional neural network. In the above example, we load the image 768 * 1024 into VGG and get the layer 24 * 32 * 1000. 24 * 32 is the image after the pooling (for 32), and 1000 is the number of image-net classes from which we can get the above segmentation.
To improve forecasting, researchers simply used a bilinear layer with oversampling.
In reference FCN the authors have improved the above idea. They connected several layers to get more saturated interpretations, which were called FCN-3? FCN-16 and FCN-? in accordance with the oversampling frequency:
Adding some skip connections between layers allowed predicting with encoding of smaller details of the original image. Further training further improved the results.
This method proved not so bad as one might think, and proved that semantic segmentation with deep learning really has potential.
FCN unveiled the concept of segmentation, and researchers were able to test different architectures for this task. The basic idea has remained unchanged: the use of well-known architectures, resampling and access connections are still present in more recent models.
You can read about achievements in this field in several good posts: here , here and here . You can also notice that most architectures have an "encoder-decoder" scheme.
Returning to our project
After doing some research, we settled on three models available to us: FCN, Unet and Tiramisu These are very deep architectures of the "encoder-decoder" type. We also had some considerations about the mask-RCNN method, but its implementation turned out to be outside the scope of our project.
FCN did not seem relevant, as its results were not as good as we wanted (even as a starting point), but the other two models showed quite good results: the main advantages of Unet and Tiramisu with the dataset CamVid were their compactness and speed. Unet was pretty easy to implement (we used keras), but Tiramisu was also quite feasible. To start with something, we used a good implementation of Tiramisu, described in the last lesson course deep training Jeremy Howard.
We began to teach these two models on some data sets. I must say that after we first tried Tiramisu, its results had a much greater potential for us, as the model could capture sharp edges of the image. Unet, in turn, was not good enough, and the results looked a little blurry.
Blurring the results of Unet.
Having decided with the model, we began to search for suitable data-warehouses. Data for segmentation is not as common as data for classification, or even for detection. Besides, it was not possible to index the images manually. The most popular dataset for segmentation were: COCO , which includes about 80 thousand images in 90 categories, VOC pascal with 11 thousand images in 20 classes, and a more recent ADE20K .
We decided to work with COCO, because it includes much more images of the class "person" that we were interested in.
Considering our task, we thought about whether we will use only relevant images or a more "common" data set. On the one hand, using a more general dataset with more images and classes will allow the model to cope with more scenarios and tasks. On the other hand, training for one night allowed us to process ~ 150 thousand images. If we give the model the entire COCO data set, then it will see each image twice (on average), so it's better to cut the data a little. In addition, our model will be better sharpened for our task.
One more thing that should be mentioned: the Tiramisu model was originally trained on the CamVid dataset, which has some drawbacks, the main one being the strong monotony of the images: photos of roads made from cars. As you can understand, training on such a dataset (even if it contains people) did us no good, so after a short trial we moved on.
Images from the CamVid dataset.
The COCO dataset comes with a fairly simple API that allows us to know exactly which objects are on which image (according to 90 predefined classes).
After some experiments, we decided to dilute the dataset: first, only the images with the person were filtered out, leaving 4?000 pictures. Then we discarded all images with several people, leaving only photos with 1-2 people, since it is for such situations that our product is intended. Finally, we left only the images on which the person occupies 20% - 70% of the area, deleting pictures with a too small person or with some strange monstrosities (unfortunately, we were unable to remove all of them). Our final dataset consisted of 1?000 images, which, as we felt, was enough at this stage.
Left: suitable image. Center: too many participants. Right: The object is too small.
Although the full name of the Tiramisu model ("100 layers Tiramisu") implies a giant model, in fact it is quite economical and uses only 9 million parameters. For comparison, VGG16 uses more than 130 million parameters.
The Tiramisu model is based on DenseNet, a fresh image classification model in which all layers are interrelated. In addition, in Tiramisu, the layers of oversampling are added Bandwidth connections, as in Unet.
If you remember, this architecture is consistent with the idea presented in the FCN: the use of classification architecture, oversampling and the addition of bandwidth connections for optimization.
So the architecture of Tiramisu looks like.
Model DenseNet can be considered as the natural evolution of the Resnet model, but instead of "remembering" each layer only until the next layer, Densenet remembers all the layers in the entire model. Such connections are called highway connections. This leads to an increase in the number of filters, called the "growth rate" (growth rate). Tiramisu has a growth rate of 1? that is, with each layer we add 16 new filters until we reach layers of 1072 filters. You could expect 1600 layers, because this is a 100 layer Tiramisu model, but the layers with oversampling discard some filters.
Diagram of the Densenet model - early filters are collected on the stack throughout the entire model.
We trained our model in accordance with the schedule described in the source document: the standard loss of cross-entropy, the RMSProp optimizer with a learning factor of 1е-3 and a slight weakening. We divided our 1?000 images into three parts: 70% for training, 20% for testing, and 10% for testing. All the images below are taken from our test dataset.
In order for our training schedule to coincide with the one given in the source document, we set a sampling period at 500 images. This also allowed us to periodically save the model with every improvement of results, as we trained it on much more data than in the document (the CamVid dataset used in this article contained less than 1 thousand images).
In addition, we trained our model using only two classes: background and person, and in the original document there were 12 classes. First we tried to teach on some classes of the COCO dataset, but noticed that this does not lead to better results.
Problems with the data
Some drawbacks of the dataset lowered our estimate:
Animals . Our model sometimes segmented animals. This, of course, leads to a low IoU (intersection over union, the ratio of the intersection to the union). Adding animals to the main class or to a separate class would probably influence our results.
Parts of the body . Since we programmatically filtered our data, we could not determine whether a person's class is really a person, not a part of the body, for example, a hand or a leg. These images were of no interest to us, but still arose here and there.
An animal, part of a body, a portable object.
Portable objects . Many images in dataset are related to sports. Baseball bats, tennis rackets and snowboards were everywhere. Our model was somehow "embarrassed", not understanding how to segment it. As in the case of animals, in our opinion, adding them as part of the main class (or as a separate class) would help improve the performance of the model.
A sport image with an object.
Coarse control data (ground truth) . The COCO datacet was annotated not by pixel, but by means of polygons. Sometimes this is enough, but in some cases the control data is too "rude", which, perhaps, prevents the model from learning the subtleties.
The image itself and (very) rough control data.
Our results were satisfactory, although not ideal: we reached IoU at 84.6 on our test dataset, while the modern achievement is 85 IoU. However, the specific value varies depending on the dataset and the class. There are classes that are inherently easier to segment, for example, at home or roads, where most models easily achieve a result of 90 IoU. The more difficult classes are trees and people, on which most models achieve results of about 60 IoU. So we helped our network focus on one class and limited types of photos.
We still do not feel that our work is "ready for release", as we would like, but we believe that it's time to stop and discuss our achievements, as about 50% of the photos will give good results.
Here are some good examples that will help you to feel the possibilities of the application:
Image - control data - our results (from our test dataset).
Debugging and logging
Debugging is a very important part of training neural networks. At the beginning of our work it was very tempting to start right away - take the data and the network, start training and see what happens. Nevertheless, we found out that it is extremely important to keep track of each step, examining the results at each stage.
Here are often encountered difficulties and our solutions:
Early problems . The model can not begin to learn. This may be due to some internal problem or preprocessing error, for example, if you forget to normalize some pieces of data. In any case, a simple visualization of the results can be very useful. Here's a good post about this theme.
Debugging the network itself . In the absence of serious problems, training begins with predetermined losses and metrics. In segmentation, the main criterion is IoU - the ratio of the intersection to the union. It took us several sessions to start using for our models. IoU as the main criterion (rather than loss of cross-entropy). Another useful practice was the mapping of the prediction of our model in each sampling period. Here is good article on debugging machine learning models. Note that IoU is not a standard metric /loss in keras, but you can easily find it on the Internet, for example, here . We also used this gist to draw up a loss schedule and some forecasting in each sampling period.
Control of machine learning versions . When learning a model, there are many parameters, and some of them are very complex. I must say that we have not yet found the ideal method, except that we enthusiastically fixed all our configurations (and automatically saved the best models with callback keras, see below).
Debug tool . After doing all of the above, we were able to analyze our work at every step, but not without difficulties. Therefore, the most important step was to combine the above steps and upload data to Jupyter Notebook (a tool for creating analytical reports), which allowed us to easily download each model and each image, and then quickly examine the results. Thus, we were able to see the differences between the models and discover pitfalls and other problems.
Here are examples of improving our model, achieved by setting parameters and additional training:
To save the model with the best IoU result (for simplicity, Keras allows you to make very good callbacks ):
callbacks =[keras.callbacks.ModelCheckpoint(hist_model, verbose=1,save_best_only =True, monitor= ’val_IOU_calc_loss’), plot_losses]
In addition to normal debugging of code errors, we noticed that model errors are "predictable". For example, "cutting off" parts of the body that are not counted as a body, "gaps" on large segments, unnecessary extensions of body parts, poor lighting, poor quality and lots of details. Some of these errors were bypassed by adding specific images from different data sets, and for some solutions it was not found yet. To improve the results in the next version of the model, we will use augmentation for "complex" images for our model.
We have already mentioned this above (in the section on the problems of the dataset), but now consider some of the difficulties in more detail:
Clothes . Very dark or very light clothing is sometimes interpreted as a background.
"Clearances" . The results, good in the rest, sometimes had gaps in themselves.
Clothes and gaps.
Lighting . In the images often there is poor lighting and darkness, but not in the COCO dataset. Models generally find it difficult to work with such pictures, and our model was not prepared for such difficult images. You can try to solve this by adding more data, as well as using data augmentation. In the meantime, it's better not to test our application at night :)
An example of poor lighting.
Options for further improvement
Continuation of training
Our results were obtained after about 300 sampling cycles over our test data. After this, over-fitting began. We achieved such results very close to the release, so we did not have the opportunity to apply the standard practice of data augmentation.
We trained the model after resizing the images to 224x224. Also, the results should be further improved with more data and larger images (the original COCO image size is about 600x1000).
CRF and other improvements
At some points, we noticed that our results are a bit "noisy" at the edges. A model that can cope with this is the CRF (Conditional random fields). In this post the author gives a simplified example of using CRF.
However, we were not very motivated, probably because this model is usually useful when the results are more rude.
Even with our current results, segmentation is not ideal. Hair, thin clothes, tree branches and other small objects will never be segmented ideally, if only because the segmentation of the control data does not contain these nuances. The task of separating such a delicate segmentation is called matting, and it also reveals other complications. Here is an example of modern matting, published at the beginning of this year at the NVIDIA conference.
Example matting - input data include trimap.
The matting task differs from other tasks related to image processing, because the input data includes not only the image, but also the trimap - the contour of the edges of the images, which makes matting a problem of "semi-controlled" learning.
We experimented a bit with matting, using our segmentation as trimap, but did not achieve significant results.
Another problem was the lack of suitable data for learning the data.
As it was said at the beginning, our goal was to create a meaningful product with the help of in-depth training. As you can see in posts Alona , the introduction is becoming easier and faster. On the other hand, with the training of the model the situation is worse - training, especially when it is conducted overnight, requires careful planning, debugging and recording the results.
It is not easy to balance between research and attempts to do something new, as well as routine training and improvement. Because we use deep training, we always have a feeling that a more perfect model is just around the corner, or just such a model as we need, and one more search on Google, or one more article read will lead us to the desired one. But in practice, our actual improvements resulted from the fact that we "squeezed" more and more of our original model. And we still feel that we can squeeze much more.
We had a lot of fun doing this work, which a few months ago seemed like science fiction.