ComputerVision and with what it eats

ComputerVision and with what it eats With the development of computer power and the emergence of a variety of image processing technologies, the question has become more and more often: is it possible to teach a machine to see and recognize images? For example, to distinguish a cat from a dog or even a bloodhound from a basset? We can not speak about the accuracy of recognition: our brains can understand incomparably faster what is before us, provided that earlier we have received enough information about the object. Those. even seeing only a part of the dog, we can say with certainty that this is a dog. And if you are a dog breeder, then you can easily determine the breed of a dog. But how to teach a machine to distinguish between them? What are the algorithms? Is it possible to deceive the car? (Spoiler: of course you can! Just like our brain.) Let's try to comprehend all these questions and, if possible, answer them. So, let's get started.
Times , And , Two ). It should be noted that the neural network is not a simple, but an improved convolutional neural network (CNN) and Convolutional Neural Network with Refinement (CNN-R).
And now we will understand the principles of this algorithm, why a convolutional neural network is needed and with what parameters it works. And, of course, it's better to do it in practice right away. Assistant in the studio!
This wonderful Husky will be our assistant, and we will experiment with her. She is already shining with happiness and wants to get into the details as soon as possible.
So, in simple words: a convolutional neural network works with two "commands" (strongly exaggerate, but it will be more understandable). The first - select a fragment, the second - try to find the object entirely. Why entirely? Having a set of the smallest patterns with contours of whole objects, the neural network is easier to receive characteristic features in the future and compare them with the image that needs to be recognized. And then the image of the algorithm in the final form will help us. Who wants to study in detail, with tons of matan - that's pruflink . And we go further.
Explain the following algorithm is not so difficult. This is a complete algorithm for the operation of a convolutional neural network with an enhancement module (here is the one that is dashed and is present after each step of convolution).
The first step: saw the image in half (with the husky done, but not so impressive looks, as with the elephants in the picture). Then in half, then again and again, and so we reach the cockroaches, in the sense of block 14x14 and less (the minimum size, as already mentioned, 3x3). Between the steps is the main magic. Namely - obtaining a mask of the whole object and the probability of finding the whole object on the sawn piece. After all these manipulations, we get a set of pieces and a set of masks to them. What now to do with them? The answer is simple: as the image is reduced, we get degradation of neighboring pixels, which will simplify the construction of the mask in a simplified form and will make it possible to increase the probability of distinguishing between transitions between objects.
Based on the analysis of the reduced images, independent initial descriptions are constructed and then the result is averaged over all masks. But it should be noted that this option is not the best for object segmentation. The averaged mask may not always be applicable for small color differences between objects and the background (which is why this particular test image was chosen).
The proposed approach is more or less workable, but for objects that are entirely in the image and contrast with the background, at the same time the separation of two identical or similar objects (far beyond the example we will not go: a herd of sheep, all white wool) will cause considerable difficulties .
To obtain an adequate pixel mask and separation of objects, it is necessary to bring the algorithm to the mind.
And now the time has come to ask: why do we need this "modifier-improvement module" if everything seems to be as good as that?
We add at each step not only the construction of the mask, but also the use of all information obtained from low-level functions in conjunction with knowledge of high-level objects that are recognized in the upper layers of the neural network. Along with obtaining independent results, generation and coarse recognition is performed at each step, the result is a semantically complete functional multichannel map into which only positive information from the early layers is integrated.
What is the difference between a classical convolutional neural network and an improved convolutional neural network?
Each step of reducing the original image makes it possible to obtain an encoded mask generated from a common image to a reduced one. At the same time, an improved convolutional neural network moves from a smaller image to a larger one, obtaining characteristic functions (points). The result is a mask obtained as a result of bi-directional fusion of functions and characteristic points.
It sounds quite confusing. Let's try to apply to our image.
Effectively. This is called DeepMask - rough edges of objects in the image. Let's try to figure out why this is so. Let's start with a simple one - from the nose. When degrading the image, the contrast is obvious, so the nose is selected as an independent object. The same with the nostrils: at certain levels of folding, they became independent objects due to the fact that they were completely on the fragment. In addition, the muzzle of the second dog is separately surrounded (as, do not see? Yes, here it is, right in front of you!). A piece of the hand was recognized as a background. Well, what to do? With the main part of the background the color does not contrast. But the transition "hand-T-shirt" is highlighted. And the error in the form of a "hand-grid" spot, and a large spot, an exciting background and the head of a husky.
Well, the result is interesting! How can you help the algorithm to cope with the task better? Only if you mock our assistant. First, let's try to make it in grayscale. Hmm Photos are worse than in the passport. After that, the lab technician has nothing to lose, I just turned on one filter and everything turned up Sepia, followed by solarization, overlaying and subtraction of layers, blurring the background with the addition of images - in general, what I saw, then applied; The main thing is that the objects become more visible in the image. As they say, the picture in photoshop will endure all. They mocked, it's time to see how the image will be recognized now.
The neural network said: "There is not one here who can be somehow classified." Logically, contrast only the nose and eyes. Not very typical with a small set for training.
Wow, what a horror (sorry, friend)! But here we seriously added contrast to the objects in the image. How? Take and duplicate the image (can be several times). On one we twist the contrast, on the other - brightness, on the third - we burn the colors (we make them unnaturally bright) and then we add all this. And finally try to shove the long-suffering husky into the processing.
I must say that it became better. Do not be direct "oh, how cool", but better. The number of incomplete objects has decreased, the contrast between objects has appeared. Further experiments with preliminary processing will give a greater contrast: object - object, object - background. We get 4 segments instead of 8. From the point of view of processing a large stream of images (150 images per minute), it is better not to bother at all with preliminary processing. She's like that, play around on your home computer.
Go ahead. SharpMask will practically not be different. A SharpMask is the construction of a refined mask of objects. The algorithm with the improvement is the same.
The main problem with DeepMask is that this model uses a simple direct propagation network that successfully creates "rough" masks, but does not perform segmentation with pixel accuracy. We skip an example for this step, since the Huskies do not live well.
The last step is an attempt to recognize what happened after the mask was refined.
We launch a quickly assembled demo and get the result - as much as 70%, that it's a dog. Not bad.
But how did the car know it was a dog? So we poured pieces, got beautiful masks, matrices to them and sets of signs. What's next then? And then everything is simple: we have a trained network, which has reference sets of signs, masks and stuff, stuff, stuff. Well, there are they, and what's next? Here is our husky with a set of signs, here is a reference spherical dog in a vacuum with a set of signs. Dull comparison in the forehead can not be done, because the lack of attributes in the image that we recognize will lead to an error. And what should I do? To do this, came up with such a wonderful option, like dropout, or randomly reset network parameters. This means the following: both sets are taken and from each of them randomly deletes signs (in other words, there are sets of 10 signs, dropout = 0.1; one is discarded, then we compare). And as a result? And as a result - PROFIT.
Immediately answer the question why the second dog is not a dog, and the hand is not a person. The test sample was only 1000 images of cats and dogs. I was trained only in two steps of evolution.

Instead of conclusions

So, we got a picture and a result that it's a dog (it's obvious to us, but for a neural network it was not very). Training was conducted in just two steps and there was no evolution of the model (which is very important). The image was loaded as is, without processing. In the future, it is planned to check whether the neural network will recognize dogs by the minimum set of characteristics.
From the pros:
We trained on the same machine. If necessary, files can be scattered immediately to other machines, and they (machines) will already be able to do the same.
High accuracy of determination when learning and evolving the network (this is sooooo long).
You can evolve under different recognition algorithms and "pre-learn" the network.
Huge database of images COCO and VOC (updated annually).
From the minuses: dancing with a tambourine with each framework.
P.S. During the experiments, no husky was harmed. About how they collected what they collected, how many rakes, what places and so on - in the next article of our cycle "Machine vision for housewives".
P.P.S. And quite-quite: a silver bullet does not exist, but there is a faust-chuck that can be grinded with a file. And for a quick start, the following frameworks were used:
+ 0 -

Add comment