Optical trackers: ASEF and MOSSE

One of the important subtasks of video analytics is the tracking of objects on video. It is not so primitive that it was necessary to descend to the per-pixel level, but it is not so difficult to unambiguously require a multilayer neural network for solving. Tracking can be used as an end in itself, and as part of other algorithms:

Counting of unique people who entered a certain zone or crossed the border in frame

Identification of typical routes of cars in the parking lot and people in the store

Automatic rotation of the CCTV camera when the object is displaced

Even without looking at the literature, I can say with confidence that the best way to solve the task is to use neural networks. In general, you could not write anything further, but it's not always possible to throw a pair of GTX 1080Ti into the task. Who cares how to track objects on the video in such cases, I ask under the cut. I will try not just to explain how the ASEF and MOSSE trackers work, but to lead you to a decision so that the formulas seem obvious.

work. Average of Synthetic Exact Filters (ASEF):

Here we use the property of linearity of Fourier images. Adding the frequencies, as shown above, we seem to average several filter weights.

You can find filter frequencies that satisfy all the pictures on average, approximately as for

:

To find the minimum, we need to take the derivative of the filter elements:

An honest take of this derivative can be found in work Visual Object Tracking using Adaptive Correlation Filters , which offers the filters Minimum Output Sum of Squared Error (MOSSE-filters). The result is:

Hmm, as if similar elements participate in the formulas. With

The formulas for ASEF and MOSSE are exactly the same.

for one image it is possible to present as

Substitute in the formula for ASEF and get

Aha! Now it is much better to see that ASEF and MOSSE differ only in the method of filter averaging! It is argued that MOSSE generates better filters than ASEF. It sounds logical: tuning to the whole pack of images as a whole is better than averaging the filters.

After we received the

, we compute the response in the frequency domain

, then we translate it into a spatial region and search for a maximum in the resulting matrix

. Where maximum - there is a new position of the object.

### Additional moments

The terms in the denominators of the formulas have an interesting physical meaning.

Is the energy spectrum of a rectangle with

image.

Pay attention to the interesting symmetry. It was necessary to multiply the frequency of the filter by the frequency of the image to get a response. Now we need to multiply the response frequencies by the image frequencies (and normalize) to obtain the filter frequencies.

In real life, by element-wise division, division by zero can occur, so that the regularizing constant

is usually added to the denominator.

. It is argued that such regularization forces the filter to pay more attention to low frequencies, which improves the generalizing ability.

When processing a real video, you usually want to save information about the tracked object, obtained from previous frames. When moving to the next frame, you can not calculate

from scratch, and update the previous one. The update formula for ASEF is

For MOSSE it is necessary to accumulate the numerator and denominator separately:

where

- speed of training.

It is important to remember that the Fourier transform is not quite the same as the calculation of

honestly, as described at the beginning of the article. When calculating the FFT, the missing elements are not nullified, but are substituted on the reverse side, as if the image were looped from right to left and from bottom to top. But at the very beginning of the article we already decided to darken the edges of

, so this problem will not have a noticeable effect.

As mentioned above, cross-correlation has an unpleasant feature: on the whole, a light filter can give a strong response to white areas of the image, even if they do not coincide in contrasting areas. This problem is not limited. Even a single matching pixel with a strongly positive or strongly negative value can interfere with the filter if the sample is generally low contrast. To smooth out this effect, in the preprocessing you need to include non-linear transformation of pixels of the image, which will "press" too light and too dark areas to the middle one. Due to this, the coincidence of the present contrasting parts makes a stronger contribution to the metric. The articles ASEF and MOSSE use the logarithm:

where the pixels are

from 0 to 255. In my opinion, this is too hard, and ignores the problem of strong response

*dark*filter on the

*black*sites. The following scheme works better:

After that comes the normalization, and it turns out that most of the elements are centered around zero.

How can we determine with such an algorithm that the monitored object has disappeared from the frame? Here, a more detailed analysis of the response received from the next frame will help. The creators of MOSSE offer the PSR indicator - Peak to Sidelobe Ratio. Suppose that

- the maximum element is

corresponding to the new position of the object

. We exclude from consideration the square

around this maximum. We calculate the average and standard deviation for the remaining pixels (

). Then

If this value is above a certain threshold, the detection is considered successful. The threshold is usually taken in the region between 3 and 10. For reliable detection, the PSR usually holds above 20.

(note that here PSR means not at all what it usually means in the theory of signal processing, so do not google its application, it will not do anything good)

The algorithm is extremely simple. The tracking procedure for Core-i7 on pictures with 320x400 when using OpenCV implementation takes from 0.5 to 2 milliseconds, depending on the size of the monitored object.

### Algorithm MOSSE

We will collect all together.

**General condition:**

The filter frequency matrix:

Auxiliary matrices for calculating filter frequencies:

The matrix of frequencies of the desired ideal response:

The speed of training during the tracking:

Rectangle of the current position of the object:

Number of transformations:

Response threshold:

Auxiliary function:

**Workout**. Input: the image is

, the current training speed is

Not yet dialed

transformations:

Apply a small random affine transformation to the image centered at the center of

Cut from the image with a rectangle with the object

Apply a mask to it to gently smooth the edges of

Translate to

to the frequency domain:

If

, then replace

and

on

and

. Otherwise:

Calculate filter frequencies:

**Initialization**. Input: the image is

, a rectangle around the position of the object

Prepare the desired response

. Usually this is a completely drowned matrix with a small Gaussian in the center.

**Workout**:

, ???r3r31074.

Translate to

to the frequency domain:

**Tracking**: Input: image

Cut the rectangle

of

for the existing previous position of the object

Apply a mask to it to gently smooth the edges of

Translate to

to the frequency domain:

Translate to

into the spatial domain:

Find a maximum of

:

Calculate the response power of

If

, go out with a failure

Update position

. Adjust R if it went outside the image, or it was decided that the object increased /decreased.

**Workout**:

,

Return

Details of implementation may vary. For example,

Only 3-333132 can be preprocessed.

, but not the entire image.

can be recreated for each image transformation with varying function and response width.

You can simultaneously train several different filters on several scales of the object to detect movements to and from the distance.

## How it looks like

To begin with, several examples of a two-dimensional Fourier transform.

Some simple examples are [/b]

I recall that the result of the transformation is complex-valued. In the pictures below, the groups "image - absolute values of the frequency domain on an ordinary scale - absolute values of the frequency domain on a logarithmic scale" are presented.

Vertical lines:

The picture changes from left to right the same for any position vertically. Moreover, the change is periodic, with a clear period and an explicit pattern. Therefore, in frequency pictures, you see only frequencies along the axis

.

Cell:

Note that there are as expected series of frequencies along the axes

and

, and strange parasitic frequencies. They appeared due to the fact that, firstly, the picture is finite, whereas the Fourier image decomposes into a beautiful sum only for an infinite periodic signal. Secondly, you can see that the image does not form an exact period on the edges.

Inclined lines:

Again, both the frequencies corresponding to the main direction and the parasitic frequencies are visible.

Inclined lines plus distortion:

On the image of frequencies, several characteristic directions can be seen, but it is already difficult to imagine the image intuitively from them.

For images of the real world, imagine the picture in the head by its frequencies even more difficult:

(the frequencies in the center are closed so that they do not "flash" the rest of the spectrum)

Now let's move on to the real work example:

A bundle of pictures [/b]

Image with the marked object:

The cut and preprocessed object, its spectrum in the ordinary and logarithmic scale (

):

The desired response (

):

Filter frequencies in the ordinary and logarithmic scale (

?

):

Explicitly expressed filter weights (without using the transformations

):

Note that they do not participate anywhere in the algorithm - you can only count them for the sake of interest. Also note that

*the filter looks like a fuck that*. One would expect that the filter would be something like the original image, but this is not always true. A filter similar to the image itself would hardly give the desired Gaussian a response.

Response from the next frame:

Although it is not as clean as the desired response, it is easy to determine the maximum on it.

The same example with a narrower desired response:

A bundle of pictures [/b]

Already:

:

More already:

:

At a very narrow maximum on the filter, instead of a black spot, the eye becomes clearly visible.

for the previous three cases

When using 16 transformations of the input image for training:

Another bundle of pictures [/b]

Wide maximum:

The average maximum is

Narrow maximum:

The more transformations, the less the filter clings to random elements. It is especially clear that random black-and-white spots from the middle of

disappeared.

. On the other hand, for a narrow gaussian, learning on several pictures can play a minus: look at the "ringing" formed in the filter around the eye.

If you want to see how it looks live, download hence My test repository with MOSSE implementation with output of debug images. On githaba you can find more variants MOSSE. In addition, it is in OpenCV .

## Conclusion

Thank you for your attention, Khabarovsk citizens. MOSSE and ASEF tracking are not the most complex algorithms in the world. The easier it is not only to apply them effectively, but also to understand what their creators were guided by. I hope that my explanation helped you to penetrate the researchers' head, to follow the course of their thoughts. This can be useful: machine learning is not a static field of knowledge, it has a place for creativity and own research. Try digging in some well-established algorithm: pootpilivat unnecessary limbs to speed up or add a couple, so it works better in your particular case. You'll like it!

*The article was written with the support of DSSL.*

It may be interesting

#### weber

Author**10-09-2018, 06:24**

Publication Date
#### Machine learning / Mathematics / Algorithms

Category- Comments: 10
- Views: 477

Pleasant data, significant and magnificent plan, as offer great stuff with smart thoughts and ideas, bunches of incredible data and motivation, the two of which I need, on account of offer such an accommodating data here.

Cheers!

visetech.org