Integration of dovecot and Apache Solr

Good afternoon.
Today, mail is still one of the key means of messaging in the corporate segment. The volume of stored mail only grows and over time takes hundreds of gigabytes, or even a few terabytes. At this point, users in most cases begin to experience problems in the process of using mail, for example, with a search. If you use a Web client, for example, the same RoundCube, then when you search all messages in all folders and even the content of the message itself, very often the result had to wait tens of seconds, which is not very pleasant. That's why I thought it would be time to configure the FTS plug-in in dovecot.
Describes the How Solr will break the sentence into words. In this scheme, is used. solr.ClassicTokenizerFactory , according to the documentation, he offer:
"Please, email [email protected] by 03-0? re: m37-xq."
Disassemble the words as follows:
Please, e-mail, [email protected], by, 03-0? re, m37-xq.
I'm more than happy with this, but not everyone will agree, so you can pick up your class that will be more optimal for your system. See the link that I gave above.
filter class Describes the processing the words that are output from the tokenizer. There may be specified different parameters, which you can read about from the link that I gave. I will describe the main:
solr.EdgeNGramFilterFactory - forms from the word tokens according to their parameters minGramSize and maxGramSize. At me costs 1 and 40 it means that from a word "Domains" the following tokens will be generated: "d", "to", "house", "house", "domain", "domains". Such tokens will be created up to 40 characters in size. There is a small nuance here, if the word is longer than 40 characters, for example 5? then if the user enters a query in the search with the size> 40 and < 50 то результат будет нулевым. Поэтому я и ввёл такое большое число, так как я не встречал email длиннее 40 символов, а в русском языке вообще, самое длинное слово 25 символов.
solr.LowerCaseFilterFactory - puts all the words in the lower case, added that search would be independent of the register of characters entered.
solr.StopFilterFactory - indicates Solr which words are not indexed at all and simply ignored, the words are written to a file and specified via the words parameter.
solr.EnglishMinimalStemFilterFactory - filter for processing the plural of English words, dogs will be converted to a dog, etc.
solr.EnglishPossessiveFilterFactory - Also for processing English words, removes possessive and not only endings, Man's is converted to Man.
solr.KeywordMarkerFilterFactory - language parameter, Here is described in details. If I understand correctly, some sort of exception words that solr indexes without preliminary modifications, so to speak "as is".
These parameters can be used both in the index analyzer and in the query analyzer. Naturally, these analyzers can have different parameters and they do not affect each other. On this with the scheme you can finish.
Go to solrconfig.xml. There is a time from the 7 version of Solr by default, the json format is used for communication, but the dovecot plugin uses xml. Therefore, we need to find several parameters in the file and fix them (this does not apply to Solr 6)
In the block (~ 745 line):

The block «defaults» is given to the form:

+ 0 -

Add comment