Migration of ElasticSearch data lossless

Migration of ElasticSearch data lossless
 
Academic design of the data warehouse recommends keeping everything in a normalized form, with links between. Then the rolling of changes in relational mathematics will give a reliable storage with support for transactions. Atomicity, Consistency, Isolation, Durability - that's all. In other words, the storage is specially built for secure data updates. But it is not optimal for searching, especially with a broad gesture on tables and fields. We need indices, many indices. The volumes expand, the recording slows down. SQL LIKE is not indexed, but JOIN GROUP BY sends meditation to the query scheduler.
 
The increasing load on one machine forces you to expand, either vertically to the ceiling, or horizontally, by purchasing a few more nodes. Failover requirements cause the data to be smeared across multiple nodes. A requirement for immediate recovery after a crash, without denial of service, forces you to configure the cluster of machines so that at any time any one of them can perform both writing and reading. That is, already be a master, or become it automatically and immediately.
 
The problem of quick search was solved by installing a second repository optimized for indexing. Search full-text, faceted, with stamping
and blackjack
. The second storage takes input from the records from the tables of the first, analyzes and builds the index. Thus, the cluster of data storage was supplemented by one more cluster for their search. With a similar master configuration wizard to fit the common
SLA
. All is well, the business is delighted, admins sleep at night while the machines in the master cluster master do not become more than three.
 
Elastic
 
Movement
NoSQL
greatly expanded the scaling horizon for both small and large data. NoSQL cluster nodes are able to distribute data among themselves so that the failure of one or more of them does not lead to a denial of service for the entire cluster. The payment for the high availability of distributed data became the inability to ensure their full consistency for recording at each point in time. Instead of it in NoSQL they say about
eventual consistency
. That is, it is believed that one day all data will disperse on the nodes of the cluster, and they will be agreed in the long run.
 
So the relational model was supplemented nonrelational and generated a lot of database engines, which solve the problems of triangle
with one or another success. CAP
. Developers have got into the hands of fashionable tools for building their own ideal
persistence
layer - for every taste, budget and load profile.
 
ElasticSearch is a representative of cluster NoSQL with the RESTful JSON API on the Lucene engine, an open source Java program that can not only build a search index, but also store the original document. This feint helps to rethink the role of a separate database for storing originals, or even completely abandon it. End of entry.
 
Mapping
 
Mapping in ElasticSearch is something like a schema (table structure, in SQL terms) that tells you exactly how to index incoming documents (records, in SQL terms). Mapping can be static, dynamic, or absent. Static mapping does not allow itself to be changed. Dynamic allows you to add new fields. If no mapping is specified, ElasticSearch will do it himself, having received the first document for writing. Analyzes the structure of fields, makes some assumptions about the types of data in them, skips through the default settings and writes. Such beschemic behavior at first glance seems very convenient. But in fact it is more suitable for experiments than for surprises in production.
 
So, the data is indexed, and this is a one-way process. Once created, the mapping can not be changed dynamically as ALTER TABLE in SQL. Because the SQL table stores the original document, to which you can tie the search index. And in ElasticSearch, on the contrary. He himself is the search index, to which you can tie the original document. That is why the index scheme is static. Theoretically, one could either create a field in the mapping or delete it. And in practice, ElasticSearch only allows you to add fields. Attempt to remove the field does not lead to anything.
 
Alias ​​
 
An alias is this additional name for the ElasticSearch index. Aliases can be multiple for one index. Or one alias for multiple indexes. Then the indices are logically combined and from the side they look like one. Alias ​​is very convenient for services that communicate with the index throughout their lives. For example, the product alias can hide behind itself as
products_v2
, and
products_v25
, without having to change the names in the service. Alias ​​is indispensable for data migration when they are already transferred from the old schema to the new one, and you need to switch the application to work with the new index. Switching an alias from an index to an index is an atomic operation. That is, it is performed in one step without loss.
 
Reindex API
 
The data schema, mapping, tends to change from time to time. New fields are added, unnecessary ones are deleted. If ElasticSearch plays the role of a single repository, then some tool is needed to change the mapping on the fly. To do this, there is a special command to transfer data from one index to another, the so-called
_reindex API
. It works with the ready or empty mapping of the destination index, on the server side, quickly indexing in batches of 1000 documents at a time.
 
Reindexing can do a simple field type conversion. For example
long
in
text
and back in
long
, or
boolean
in
text
and back in
boolean
. But
-???r3r3186. in
boolean
already can not,
this is not PHP
. On the other hand, the conversion type is insecure. A service written in a language with dynamic typing such a sin can and will forgive. But if reindex can not convert a type, then the document simply will not be written. In general, the data migration should take place in 3 stages: add a new field, release the service with it, clean up the old one.
 
A field is added so. A source-index scheme is taken, a new property is entered, an empty index is created. Then re-indexing is started:
 
{
"source": {
"index": "test"
},
"dest": {
"index": "test_clone"
}
}

 

The field is deleted in a similar way. The source-index scheme is taken, the field is removed, an empty index is created. Then, reindexing starts with the list of copied fields:


 
    {
"source": {
"index": "test",
"_source":["field1", "field3"]
},
"dest": {
"index": "test_clone"
}
}

 
For convenience, the cases are combined into a cloning function in Kaizen, a desktop client for ElasticSearch. Cloning is able to adjust to the mapping of the recipient index. The example below demonstrates how from the index with three collections (types, in terms of ElasticSearch)
act
,
line
,
scene
is done a partial clone. Only
remains in it. line
with two fields, the static mapping is turned on, and the field
speech_number
of
text
becomes
long
.
 

 
Migration
 
The reindex API has one nasty feature - it does not know how to monitor changes in the source index. If after the start of re-indexing, something changes there, then the changes are not reflected in the recipient index. ElasticSearch FollowUp Plugin was developed to solve this problem, which adds logging commands. The plugin can keep track of the index, returning in JSON format the actions performed on the documents in chronological order. The index, type, identifier of the document and operation on it - INDEX or DELETE are remembered. FollowUp Plugin is published in GitHub and compiled for almost all versions of ElasticSearch.
 
So, to migrate data without losses, you need FollowUp installed on the node where reindexing will be started. It is assumed that alias for the index already exists, and all applications work through it. Immediately before re-indexing, the plug-in is turned on. When re-indexing is completed, the plug-in is turned off, and alias is flushed to a new index. Then the recorded actions are reproduced on the receiver index, catching up with its state. Despite the high speed of re-indexing, during playback two types of collisions may arise:
 
 
in the new index there is no more document with such
_id
. The document was deleted after the alias was switched to a new index.
 
in the new index there is a document with such
_id
, but with the version number higher than in the source index. The document managed to update after switching the alias to the new index.
 
 
In these cases, the action should not be reproduced in the recipient index. The remaining changes are reproduced.
 
Happy coding!
+ 0 -

Add comment