Multi-armed bandits in the recommendations
Hello! My name is Misha Kamenshchikov, I am engaged in Data Science and the development of microservices in Avito's team of recommendations. In this article, I'll talk about our recommendations for similar ads and how we improve them with the help of multi-armed bandits. With a report on this topic, I spoke at the conference Highload ++ Siberia and at the event «Data & Science: Marketing» .
Recommendations for Avito
Service similar ads
Rankers and configs
A /B tests and multi-armed bandits
Selection of target actions
Recommendations for Avito
First - a brief overview of all the types of recommendations that are on Avito.
Personalized User-Item Recommendation
User-Item recommendations are built on the basis of the user's actions and are designed to help him find the product more quickly, or to show something potentially interesting. They can be seen on the main page of the site and applications or in mailing lists. To create such recommendations, we use two types of models: offline and online.
The offline model uses matrix factorization algorithms, which are trained on all data in a few days, and recommendations are added to a fast store for distribution to users. Online models consider recommendations on the fly, using the content of ads from user history. The advantage of offline models is that they give more accurate and qualitative recommendations, but they do not take into account the latest user actions that could occur during the training of the model, when the training sample was fixed. Online models immediately react to user actions, and recommendations can change with each action.
In all the reference systems there is a so-called "cold start" problem. Its meaning is that the models described above can not give advice to a new user without a history of actions. In such cases, it is better to introduce the user to what is on the site, while choosing not just random ads, but, for example, ads from categories that are popular in the user's region.
For users who often use search, we make recommended shortcuts for search - they send the user to a certain category and expose the filters. Such recommendations can be found at the top of the main application page.
Item-Item recommendations are presented on the site in the form of similar ads on the product card. They can be seen on all platforms under the ad description. The model of recommendations at the moment is exclusively content and does not use information about user actions, therefore we can immediately find the new announcement similar among the active ads on the site. Further in the article, I will discuss exactly this kind of recommendations.
In more detail about all kinds of recommendations on Avito we are already wrote. in our blog. Read, if you are interested.
Similar ads for
Here's how the block with similar ads looks:
This type of recommendations appeared on the site first of all, and the logic was implemented on the search engine Sphinx . In fact, this model is a linear combination of various factors: word coincidence, distance, price difference and other parameters.
Based on the parameters of the target ad, a query is generated in Sphinx, and built-in ranchers are used for ranking.
weight as ranker_weight,
ranker_weight * 10 +
(metro_id = 42) * 5 +
(location_id = 2525) * 10 +
(110000 /(abs (price - 1100000) + 110000)) * 5 as rel
WHERE MATCH ('@ title bike | scott | scale | 700 | premium') AND microcat_id = 100
ORDER BY rel DESC, sort_time DESC
LIMIT ??? OPTION max_matches = 3?
ranker = expr ('sum (word_count)')
If we try to describe this approach more formally, then we can represent the rankers in the form of some weights
, and the ad parameters are
(source - the original announcement) and
(target ad). For each of the parameters, we introduce the function of similarity
. They can be binary (for example, a city match ads), can be discrete (the distance between ads) and continuous (for example, the relative difference in price). Then for the two ads, the final account will be expressed by the formula
With the help of the search engine, you can quickly calculate the value of this formula for a sufficient large number of ads and rank the delivery in real time.
This approach showed itself quite well in its original form, but it had significant shortcomings.
Problems of the approach
First, the weights were initially chosen expertly and were not questioned. Sometimes there were changes in the scales, but they were made on the basis of a point feedback, and not on the basis of an analysis of the behavior of users in general.
Secondly, the weights were confused in the site, and it was very inconvenient to make changes to them.
Step towards automation
The first step toward improving the model of recommendations was to take all the logic into a separate microservice in Python. This service already belonged entirely to our team of recommendations, and we could conduct experiments quite often.
Each experiment can be characterized by a so-called "config" - a set of weights for ranchers. We want to generate configs and choose the best ones based on user actions. How can I generate these configs?
The first option, which was from the very beginning, is the expert generation of configs. That is, using knowledge from the subject area, we assume that, for example, when searching for a car, people are interested in options close to the price to be viewed, and not just similar models that can cost much more expensively.
The second option is random configs. We assign some distribution for each of the parameters and then just take a sample from this distribution. This method is faster, because you do not have to think over each of the parameters for all categories.
An even more complicated version, for which a satisfied long history of experiments is required, is the use of machine learning. We can represent a set of parameters as a feature vector, and the target metric as the target variable. Then we will find a set of parameters that, according to the model, will maximize our target metric.
In our case, we settled on the first two approaches, as well as genetic algorithms in the simplest form: leaving the best, removing the worst.
Now we are close to the most interesting part of the article: we are able to generate configs, but how to conduct experiments to make it quick and efficient?
Carrying out the experiments
The experiment can be conducted in the A /B /test format. To do this, we need to generate N configs, wait for statistically significant results, and finally choose the best config. This can take quite a long time, and during the whole test a fixed group of users can receive bad quality recommendations - this is quite possible with the occasional generation of configs. Also, if we want to add some new config to the experiment, we'll have to restart the test again. But we want to conduct experiments quickly, to be able to change the experimental conditions. And so that users do not suffer from obviously bad configs. This will help us multi-armed bandits.
The name of the method came from "one-armed bandits" - slot machines in a casino with a lever, for which you can draw and get a win. Imagine that you are in the hall with a dozen of these machines, and you have N free attempts to play on any of them. You, of course, want to win more money, but you do not know in advance which machine gives the biggest win. The problem of multi-armed bandits is precisely to find the most profitable machine with minimal losses (playing on unprofitable machines).
If we formulate this in terms of our recommendations, then it turns out that automata are configs, each attempt is a choice of a config to generate recommendations, and a gain is some target metric that depends on the user feedback.
The advantage of bandits over A /B /tests
Their main advantage is that they allow us to change the amount of traffic depending on the success of this or that config. This just solves the problem that people will get bad recommendations throughout the experiment. If they do not click on the recommendations, then the gangster will quickly understand this, and will not choose this config. In addition, we can always add a new config to an experiment or simply delete one of the current ones. Together, it gives us the flexibility to conduct experiments and solves the problems of the approach with A /B /testing.
There are a lot of strategies for solving the problem of multi-armed bandits. Next, I'll describe a few classes of easy-to-implement strategies that we tried to apply in our task. But first you need to understand how to compare the effectiveness of strategies. If we know in advance which pen gives the maximum win, the optimal strategy will always be to pull this pen. Fix the number of steps and calculate the optimal reward
. For the strategy, we also calculate the reward
and introduce the notion
Further strategies can be compared precisely with this value. I note that the strategies of the nature of the change
can be different, and one strategy may be better for a small number of steps, and another for a larger one.
As you can guess from the title, greedy strategies are based on one simple principle - always choose a pen that, on average, gives the greatest reward. The strategies of this class differ in how we explore the environment to determine this very pen. There are, for example,
strategy. It has a single parameter -
, which determines the probability with which we choose not the best pen, but random, thus exploring our environment. You can also reduce the likelihood of research over time. This strategy is called
. Greedy strategies are very simple to implement and understand, but not always effective.
This strategy is completely deterministic - the handle is uniquely determined from the information available at the moment. Here's the formula:
Part of the formula with
means the middle of the j-th handle and is responsible for exploitation, just like in greedy strategies. The second part of the formula answers just for the exploration,
Is the total number of actions,
- the number of actions of the j-th handle. For this strategy, there is a proved estimate on
. About it you can read here .
In this strategy, we put each handle in correspondence with some random distribution and at each step we sample from this distribution of number, choosing the handle according to the maximum. Based on feedback, we update the distribution parameters so that the best handles correspond to a distribution with a large average, and its variance decreases with the number of actions. More details about this strategy can be found in the excellent article .
Comparison of strategies
We simulate the strategies described above on an artificial medium with three handles, each corresponding to the Bernoulli distribution with parameter
respectively. Our strategies:
Sampling Thompson with
The graph shows that the greedy strategy
grows linearly, and in the other two strategies - logarithmically, and Thompson's seizure shows itself much better than the others and its
almost does not grow with the number of steps.
I introduced you to many-armed bandits and now I can tell you how we used them.
The multi-armed bandits in Avito
So, we have several configs (sets of parameters), and we want to choose the best ones with the help of multi-armed bandits. For bandits to learn, we need one important detail - a custom feedback. At the same time, the choice of actions on which we are trained should meet our goals - I want to make more transactions with better recommendations.
Choose the target actions
The first approach to choosing target actions was quite simple and naive. As the target metric, we chose the number of views, and the bandits well learned to optimize this metric. But there was a problem: more views are not always good. For example, in the category "Auto" we managed to increase the number of views by 15%, but the number of requests for contact on the contrary fell by about the same amount. On a more detailed examination, it turned out that the bandits chose the handle in which the filter for the region was turned off. Therefore, the block showed ads from all over Russia, where the choice of similar ads, of course, more. This caused an increase in the number of views: outwardly the recommendations seemed more qualitative, but when people entered the card, people understood that the car was very far away from them and did not make a contact request.
The second approach is to learn the conversion from the block view to the contact request. This approach seemed better than the previous one, if only because this metric is almost directly responsible for the quality of the recommendations. But there was another problem - the daily seasonality. Depending on the time of day, the conversion values may differ by four (!) Times (this is often more than the effect of a better configuration), and the handle that was lucky to be selected in the first gap with high conversion continued to be selected more often than others.
Daily dynamics of conversion (values on the Y axis are changed)
Finally, the third approach. We use it now. We select a reference group that always receives recommendations on the same algorithm and is not affected by the bandits. Next we look at the absolute number of contacts in our target group and the reference one. Their ratio is not subject to seasonality, since we select the reference group randomly, and this approach is conveniently placed on the sampling of Thompson.
Our strategy is
We select the reference and target groups, as described above. Then initialize the N handles (each of them corresponds to the configuration) with the
At each step:
sample for each pen a number from the corresponding distribution and select the handle according to the maximum;
count the number of actions in groups
for a certain quantum of time (in this case it's 10 seconds) and update the distribution parameters for the selected pen:
Using this strategy, we optimize the metric we need, the number of contact requests in the target group, and avoid the problems I described above. In addition, in the global A /B test, this approach showed the best results.
The global A /B test was structured as follows: all ads are randomly divided into two groups. To the ads of one of them we show recommendations with the help of bandits, and to the other - the old expert algorithm. Then we measure the number of conversion requests for a contact in each of the groups (requests made after switching to ads from the block of similar ones). On average, for all categories, the group with bandits receives about 10% more targeted actions than the control group, but in some categories this difference can reach 30%.
And the created platform allows you to quickly change the logic, adding configs to bandits, and validating hypotheses.
If you have questions about our service recommendations or about multi-armed bandits, ask them in the comments. With pleasure I will answer.
It may be interesting
Pleasant data, significant and magnificent plan, as offer great stuff with smart thoughts and ideas, bunches of incredible data and motivation, the two of which I need, on account of offer such an accommodating data here.
Situs QQ Online