The history of a single query

3r33434. The history of a single query
 3r33450. Submit your first day at the new job. The office is located in the area completely unknown to you Kurskaya metro station. Dinner time is coming. You open the search application, write “Eat at the Kursk” 3-3r3420. and get a selection of options where you can dine.
 3r33450.
 3r33450. What is behind the request “Eat at the Kursk” 3-3r3420. and how is it processed to find exactly what you need? In the article I will tell you how the 2GIS Search team is doing everything possible to make life in cities more convenient and comfortable for users.
 3r33450.
3r33338.
 3r33450. It is important to understand that the text of the search query is only the tip of the iceberg, a small part of what the user interacts with. The search query itself, in addition to the text, contains a lot of other data. They include personalized information about a user's location, a map section that he sees, a set of records from his favorites and information about search modes. For example, search on the map or in the building, and maybe search for travel. Data - the key to success of good search functionality.
 3r33450.
 3r33450. We pay great attention to our data and their structure. Moreover, we call the search algorithm in 2GIS itself a structural search, because it is designed for efficient and fast search in our structured data. We specifically prepare the search index and the data from which it is built. I will not go into details, I’ll just say that the data is organized in such a way that it is fairly simple to process, compressed well, and most importantly allow us to quickly process it even on mobile devices.
 3r33450.
 3r33450. Moreover, the search is able to work offline, and therefore makes special demands on the speed and volume of the search index. We pay great attention to this feature - we constantly compress the search index, we estimate the speed of performance on various platforms and we speed up the functionality where specific search cases exceed the set time limit. By the way, we can boast that an ordinary search query in 2GIS on a mobile device is faster than the application draws a drop-down list of the results.
 3r33450.
 3r33450. Below I will open the veil of secrecy covering the magic of our search. As an example, take the above query “Eat at the Kursk” 3-3r3420. . Consider the stages of its processing and what happens at each of them. So let's go!
 3r33450.
 3r33450. 3r33434. 3r3342.
 3r33450.
 3r33450. 3r33434. Stage 1. Parsing
 3r33450. Input parameters: query “Eat at the Kursk” 3-3r3420.
 3r33450.
 3r33450. First of all, we need to parse the text of the request. This is important, because after parsing we will be able to work not with the query text, but with the set of tokens of which it is composed. Tokens are single query words. In our case, we get a set of three tokens: "Eat" , 3r33419. “On” 3r33420. , 3r33419. "Kursk" 3-333420. . It would seem that everything is simple, but the devil is in the details. And sometimes everything is not so obvious: for example, in requests for “wi-fi” or “2nd” we should understand that such combinations should be processed entirely.
 3r33450.
 3r33450. The tokens themselves contain information about the text of the word, about the position in the query, about the presence of a separator following the word and some characteristic of the word - the register of its characters, whether the word itself is a number, serial number, telephone number, address or other data.
 3r33450.
 3r33450. 3r33434. Stage 2. Vocabulary search
 3r33450. Input parameters: tokens. "Eat" , 3r33419. “On” 3r33420. , 3r33419. "Kursk" 3-333420.
 3r33450.
 3r33450. 3r33434.
 3r33450. With the ready list of request tokens, we proceed to the stage of dictionary search, that is, to the stage where we find the list of word forms from our data for each token. The word form is encoded information about the root of the word and its ending.
 3r33450.
 3r33450. We present the dictionary search as an algorithm for analyzing a dictionary represented as a graph. The nodes in it are letters, or rather, symbols. A graph consists of symbol nodes and transitions between these nodes. The result of the traversal of our dictionary dictionary is a set of word forms that we can get based on the transferred tokens from the previous stage. So we try to find in our dictionary a sequence of characters that matches the next token from the query. In this case, as we all know, users make typos, skip letters, or even make mistakes in the keyboard layout. Therefore, when traversing the dictionary, we apply some manipulations in order to take into account the possible human factor or try to guess what a person is typing right now. In the course are various transformations of character chains: inserts, replacements, appending characters and the like.
 3r33450.
 3r33450. As a result, for each request token, we extract a set of word forms with information about the root and the end of a word, information about the number of characters in a word form, and an estimate of matching. Assessment of finding - assessment, characterizing the dictionary distance of the found sequence of characters to the desired sequence. The score describes how much the found string of characters differs from the request token.
 3r33450.
 3r33450. So for tokens we find the word forms:
 3r33450.
 3r33450.
 3r33450. 3r3308. 13 forms for 3r33434. "Eat"
: “Eat”, “eat”, “paese”, “payot”, 3r3309.  3r33450. 3r3308. 3 forms for “On” 3r33420. : “Na”, “nu”, “on”
 3r33450. 3r3308. 48 forms for 3r33419. "Kursk" 3-333420. : “Kursk”, “Kursk”, “Kursk”, “Kursk”, “Kurako”, 3r3309.  3r33450. 3r33333.
 3r33450. Next, the found word forms are filtered depending on their assessment. The best of them, that is, as close as possible to the words from the query, fall into the list of terms. The term will be understood as a word form and assessment of findability. Plus, in addition to the found word forms, the terms added using the rules of morphology are added to the list. An important stage of morphology processing is the addition of a morphological assessment. The fact is that in our search a strong morphology processing mechanism is used, which allows us not only to find similar words from the dictionary, but according to the rules of the natural language, it is more correct to find what interests the user by meaning, not only by similarity of words.
 3r33450.
 3r33450. So for the tokens the following terms will be created: 3r3442.  3r33450.
 3r33450.
 3r33450. 3r3308. 3r33434. "Eat"
: “Eat”, “eat”, “eat”, “eat”, “eat” 3–3–3309.  3r33450. 3r3308. 3r33434. “On” 3r33420. : “On”, “na”, “nu”
 3r33450. 3r3308. 3r33434. "Kursk" 3-333420. : “Kursk”, “Kursk”, “Kursk”, “Kursk”, “Kursk” 3–3–3309.  3r33450. 3r33333.
 3r33450. At this stage of vocabulary search ends. We have processed the request and have for each token a list of terms that go to further processing. These terms contain all the information about the word they represent, and have an estimate of how each of them was found.
 3r33450.
 3r33450. 3r33434. Step 3. Search for occurrences in the data
 3r33450. Admission: a set of terms for each part of the query
 3r33450.
 3r33450.
 3r33450. 3r3308. 3r33434. "Eat"
: “Eat”, “eat”, “eat”, “eat”, “eat” 3–3–3309.  3r33450. 3r3308. 3r33434. “On” 3r33420. : “On”, “na”, “nu”
 3r33450. 3r3308. 3r33434. "Kursk" 3-333420. : “Kursk”, “Kursk”, “Kursk”, “Kursk”, “Kursk” 3–3–3309.  3r33450. 3r33333.
 3r33450. Having obtained from the previous stage a set of terms for each of the parts of the query, we proceed to searching them by our index. Each document in the data has multiple headers and is written in 3r3187. reverse index
so that we can easily find all references to the desired term in specific documents representing organizations, addresses or any other.
 3r33450.
 3r33450. For each of the parts of the query and for each of the terms of these parts, we are looking for documents containing words encoded in terms. So, for parts of the query for all terms, the occurrences will be found:
 3r33450.
 3r33450.
 3r33450. 3r3308. 3r33434. "Eat"
: 18 occurrences of
 3r33450. 3r3308. 3r33434. “On” 3r33420. : 4304 occurrences of
 3r33450. 3r3308. 3r33434. "Kursk" 3-333420. : 79 occurrences of
 3r33450. 3r33333.
 3r33450. Obviously, the preposition is “On” 3r33420. occurs many times and therefore falls into a variety of document headers - “coffee 3r-3419. on 3r320. takeaway, play r3r3419. on 3r320. prefix "," setting on 3r320. accounting machines. 3r33434. "Eat" and 3r33419. "Kursk" 3-333420. also used repeatedly. The second word with its terms is found in our data much more often than the first.
 3r33450.
 3r33450. 3r33434.
 3r33450. A hit is a match for a word from a search query to a word from a specific document. These hits are saved to the list, which will be analyzed in the next step. When adding a hit, we not only copy data from the term about the word in the document, but also calculate the best estimate of how the word could be found. In other words, we choose the morphological assessment of the term, or an assessment of how the term was found in the dictionary, depending on which of the estimates is closer to the request token.
 3r33450.
 3r33450. This stage is a prelude to the start of the search itself. We have prepared a set of hits in specific documents, on the basis of which the next algorithm will select and evaluate what needs to be given to the user as a result.
 3r33450.
 3r33450. 3r33434. Step 4. The heart of the search
 3r33450. Login: hit list
 3r33450.
 3r33450.
 3r33450. 3r3308. 3r33434. "Eat"
: 18 occurrences of
 3r33450. 3r3308. 3r33434. “On” 3r33420. : 4304 occurrences of
 3r33450. 3r3308. 3r33434. "Kursk" 3-333420. : 79 occurrences of
 3r33450. 3r33333.
 3r33450.
 3r33450. In fact, the hit list in our implementation is a tricky container. It is important to understand that when adding hits to it, special nodes are created where the hits themselves are written, and a link to the document to which these hits fell.
 3r33450.
 3r33450. Therefore, it would be more correct to display the input data as follows:
 3r33450. Login: document node container
 3r33450.
 3r33450.
 3r33450. 3r3308. 3r33434. document 1: {hits,}
3r3309.  3r33450. 3r3308. 3r33434. Document 2: {hits,} [/b] 3r3309.  3r33450. 3r3308. 3r33434. Document 3: {hits,} [/b] 3r3309.  3r33450. 3r3308. 3r33434. Document 4: {hits,} [/b] 3r3309.  3r33450. 3r3308. 3r3309.  3r33450. 3r33333.
 3r33450. First of all, the search begins to traverse the document tree and each node returns to the analyzer, which evaluates whether the document from this node is suitable to be the result for getting into the output. To understand what volumes the analyzer has to work with, I will say that at the start the container contains more than 3000 nodes! But the nodes can be added during the traversal process, so it is actually processed even more. It would not be an exaggeration to say that analysis is the most expensive part of the search and at the same time one of the most complex and large project functions. Nevertheless, it runs extremely quickly, even on mobile devices.
 3r33450.
 3r33450. 3r33434. Step 5. Analysis 3r33333.
 3r33450. Log in: Site document: {hits,}
 3r33450.
 3r33450. 3r33434.
 3r33450. The analysis starts with a list of titles from the site. The title is a title and a list of hits that fell into this title of the document. These titles will be evaluated at the first stage. It is important for us to know the usefulness of each title. Utility can be good, weak and junk.
 3r33450.
 3r33450. For each of the titles are selected the best of the chain of hits - the best in length and vocabulary score, made up of similarity of hits. Based on the best chain and will assess the usefulness of the title. To determine the utility of the chain, we also use a mechanism based on the frequency of words in the documents. Roughly speaking, the more often a word occurs in a document, the more important it is (3r33333. TF-IDF
). So, if the chain contains a hit in an important word from the title of the document without strong morphological differences, for example, an excellent number or gender, we consider the title useful. A title can also be useful if its hits completely cover the words in the title of the document.
 3r33450.
 3r33450. With the utility, all titles form a utility mask for the node. This mask gives us an idea of ​​the request tokens covered by the node being analyzed. And with its help, we largely determine whether to analyze the node further.
 3r33450.
 3r33450. As a result, we have not just one document from the index, but a set of structural data representing a potential result with information on how it can be found.
 3r33450.
 3r33450. 3r33434. Step 6. Grade 3r33333.
 3r33450. Log in: Site document: {hits,}
 3r33450.
 3r33450. 3r33434. 3r33333.
 3r33450. Depending on the utility mask, we will either process the node, or go straight to the next one. From the set of processed nodes, we accumulate various information about their totality. For example, a set of node titles, the relationship of nodes among themselves, and some other data.
 3r33450.
 3r33450. Next comes the analysis of node titles interconnected with each other. In fact, many nodes are combined into a graph of nodes, which we estimate.
 3r33450.
 3r33450. From the titles of the nodes of the graph, we obtain a list of ranked titles. Simply put, from the set of nodes we compile a single list of titles, in which for each element we also add an assessment and a combination of factors from the hit titles of all the participating nodes.
 3r33450.
 3r33450. Evaluation is a structure with information on the number of words involved in a title from a query and many other factors about how a word was found and processed - starting from the word search stage. It is important that these ratings from the ranked title will participate in selecting the best ratings. Some of them will be marked as selected and will contribute to the final assessment of the result that the user sees.
 3r33450.
 3r33450. The overall score provides the result with information that will be extremely important when sorting the entire issue. She sodThere are factors such as, for example, missing words from a query. This measure characterizes the number of words that were not covered by the node with its structural information.
 3r33450.
 3r33450. Based on information about the usefulness of titles, the clarity of the result is determined. Clarity can be good, weak and bad. And it is calculated with the participation of all selected titles for the node being processed. All these data have a dramatic impact on the further fate of the results and the order of their issuance.
 3r33450.
 3r33450. Gradually, we are approaching the end of the node analysis. Before the node finally leaves the analyzer and becomes a potential result, we carry out several more refinement manipulations. For example, on the compatibility of selected document headers.
 3r33450.
 3r33450. The node that has passed all the circles of the analyzer forms a result containing information about the selected headers and a document. The result, which gives a good coverage of the search query, is sent to the post-processing.
 3r33450.
 3r33450. 3r33434. Stage 7. Postprocessing
 3r33450. Input: the result, constructed from node
 3r33450.
 3r33450. 3r33434. 3r3407.
 3r33450. The analyzer filters out many records from the index before they become results. However, the analysis can be evaluated and sent to the processing of many potential results. To show the user only the most useful ones in order of relevance, we need to cut off the worst options found by the analyzer.
 3r33450.
 3r33450. At the previous step, a general assessment of the result was mentioned. Overall assessment allows us to cut off the worst results at the post-processing stage. Graduation is as follows. The results that did not cover any request tokens are obviously worse than those results that completely cover the user's request. Results with worse clarity are obviously less desirable than results with good clarity. The postprocessor accumulates information about the incoming results and eliminates those that are obviously worse. The rest adds to the list.
 3r33450.
 3r33450. Before giving the information about the cafe to the hungry user, we carry out the final processing - we sort them by relevance. Many factors are involved in sorting, calculated and aggregated at different stages of the search. And ultimately search results on request [b] “Eat at the Kursk” 3-3r3420. Makes over 500 results. Many of them were found in the same way, and therefore have the same rating. They will be sorted by user popularity.
 3r33450.
 3r33450. 3r33434. 3r33434.
 3r33450.
 3r33450. 3r33434. Conclusion
 3r33450. We received the issue with a lot of cafes and restaurants and, joyful, we go for lunch. All the results we got in a fraction of seconds. Nor do we even need an internet connection.
 3r33450.
 3r33450. The application stores search indexes on the device. This scheme provides us with non-trivial tasks to optimize data compression and processing speed - because the search should work quickly on a variety of mobile devices! However, this is a completely different story.
 3r33450.
 3r33450. Today I tried to open the hood of our search engine and show how we help users find what they need in the city, and do it quickly and conveniently. I hope it was informative.
3r33450. 3r33450. 3r33450. 3r33448. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () ();
3r33450.
+ 0 -

Add comment