Theory of Sharding

 3r31423. 3r3-31. It seems that we have plunged so deeply into the jungle of highload development that we simply do not think about basic problems. Take sharding, for example. What to understand in it, if you can write conditionally shards = n in the database settings, and everything will be done by itself. So, he is like that, but if, rather, when something goes wrong, the resources really start to be missed, I would like to understand what the reason is and how to fix it.
 3r31423.
 3r31423. In short, if you contribute your alternative implementation of hashing in Cassandra, then there are hardly any revelations for you here. But if the load on your services is already arriving, and the system knowledge does not keep up with it, then you are welcome. Great and terrible
Andrei Aksyonov
( Shodan 3r3-31408.) In his usual manner will tell you that 3r3-331402. sharding is bad, not sharding is also bad
and how it is arranged inside. And quite by accident, one of the parts of the story about sharding is not at all about sharding, but the devil knows what - like objects on shards mapit.
 3r31423. 3r314.
 3r31423. A photo of cats (although they happened to be puppies) already answers the question why this is all, but we will start sequentially.
 3r31423.

 3r31423.

What is a "sharding"
 3r31423.


 3r31423. If you persistently google, it turns out that there is a fairly blurred border between the so-called partitioning and the so-called sharding. Everyone calls everything he wants, what he wants. Some people distinguish horizontal partitioning and sharding. Others say that sharding is a certain kind of horizontal partitioning.
 3r31423.
 3r31423. I did not find a single terminological standard that would be approved by the founding fathers and is ISO certified. Personal inner conviction is something like this:
Partitioning
on average, this is “cutting the base into pieces” in an arbitrary manner.
 3r31423.
 3r31423. 3r31282.  3r31423.
Vertical
partitioning is columnar. For example, there is a giant table for a couple of billion entries in 60 columns. Instead of holding one such gigantic table, we hold 60 not less giant tables of 2 billion records - and this is not a column basis, but vertical partitioning (as an example of terminology).
 3r31423. 3r31297.  3r31423.
Horizontal
partitioning - we cut line by line, maybe, in the server.
 3r31423. 3r31297.  3r31423. 3r31299.
 3r31423. The awkward moment here is in the subtle difference between horizontal partitioning and sharding. You can cut me into pieces, but I surely will not tell you what it is. There is a feeling that sharding and horizontal partitioning are about the same thing.
 3r31423.
 3r31423. Sharding is generally when a large table in terms of databases or a collection of documents, objects, if you do not have a database, and the document store, is cut by objects. That is, out of 2 billion objects, pieces are selected no matter what size. The objects themselves inside each object are not cut into pieces; we do not lay them out into separate columns, but lay them out in different places in bundles.
 3r31423.
 3r31423.
3r31419. 3r31419. 3r31419.
 3r31423. 3r376. Reference to the presentation to complete the picture. [/i]
 3r31423.
 3r31423. Next came the subtle terminological differences. For example, relatively speaking, the developers at Postgres can say that horizontal partitioning is when all the tables into which the main table is divided lie in the same schema, and when on different machines it is sharding.
 3r31423.
 3r31423. In a general sense, without being tied to the terminology of a specific database and a specific data management system, there is a feeling that sharding is just cutting into lines and documents, and so on — and that’s all: 3r331409.  3r31423.
 3r31423.
Sharding (~ =, in ) Horizontal Partitioning == is typical.
 3r31423. 3r31411.
 3r31423. I emphasize typically. In the sense that we are doing all this for a reason, so as to cut 2 billion documents into 20 tables, each of which would be more manageable, but in order to distribute it to many cores, many disks or many different physical or virtual servers .
 3r31423.
 3r31423. The implication is that we do this so that every shard — every bit of data — replicates many times. But really, no.
 3r31423.
 3r31423. 3r3172. 3r3173. INSERT INTO docs00
SELECT * FROM documents WHERE (id% 16) = 0
3r31423. 3r31423. INSERT INTO docs15
SELECT * FROM documents WHERE (id% 16) = 15-3r31423. 3r3181.
 3r31423. In fact, if you do this data cutting, you will generate 16 small tablets from one giant SQL table on MySQL on your glorious laptop, without going beyond a single laptop, a single schema, a single database, etc. etc. - everything, you have sharding.
 3r31423.
 3r31423. Remembering the illustration with the puppies, this leads to the following:
 3r31423.
 3r31423. 3r31282.  3r31423.
Increases bandwidth - bandwidth.
 3r31423. 3r31297.  3r31423.
Latency does not change, that is, everyone, so to speak, a worker or consumer in this case, gets his own. It is not known what puppies get in the picture, but requests are serviced in approximately one time, as if the puppy were alone. 3r31297.  3r31423.
Either the one and the other, and more high availability (replication).
 3r31423. 3r31297.  3r31423. 3r31299.
 3r31423.
Why bandwidth?
We sometimes can have such amounts of data that do not interpose - it’s not clear where, but not intermediation - per 1 {core | disk | server |}. Just not enough resources and everything. In order to work with this big dataset, you need to cut it.
 3r31423.
 3r31423.
Why latency?
On one core, scanning a table of 2 billion rows is 20 times slower than scanning 20 tables on 20 cores, making it parallel. Data is too slowly processed on one resource.
 3r31423.
 3r31423.
Why high availability?
Or we cut the data in order to do both one and the other at the same time, and at the same time several copies of each shard - replication ensures high availability.
 3r31423.
 3r31423.

A simple example of "how to make hands"
 3r31423.


 3r31423. Conditional sharding can be cut using the test table test.documents for 32 documents, and by generating 16 test tables from this table for approximately 2 documents test.docs0? 0? 0? , 15.
 3r31423.
 3r31423. 3r3172. 3r3173. INSERT INTO docs00
SELECT * FROM documents WHERE (id% 16) = 0
3r31423. 3r31423. INSERT INTO docs15
SELECT * FROM documents WHERE (id% 16) = 15-3r31423. 3r3181.
 3r31423. Why about? Because a priori, we do not know how id is distributed, if from 1 to 32 inclusively, then there will be exactly 2 documents each, otherwise - no.
 3r31423.
 3r31423.
We do this for what it is.
After we have made 16 tables, we can “grab” 16 of what we need. Regardless of where we stand, we can parallelize these resources. For example, if there is not enough disk space, it would make sense to decompose these tables into separate disks.
 3r31423.
 3r31423. All this, unfortunately, is not free. I suspect that in the case of the canonical SQL standard (I haven’t reread the SQL standard for a long time, perhaps it hasn’t been updated for a long time), there is no official standardized syntax for any SQL server to say: “Dear SQL server, make me 32 shards and decompose them into 4 disks. ” But in individual implementations there is often a specific syntax for doing the same thing in principle. PostgreSQL has mechanisms for partitioning, MySQL has MariaDB, Oracle probably did it all a long time ago.
 3r31423.
 3r31423. However, if we do it by hand, without database support and within the standard, then 3r3-31402. We pay conventionally the complexity of access to data 3r3-31403. . Where there was a simple SELECT * FROM documents WHERE id = 12? now 16 x SELECT * FROM docsXX. And well, if we tried to get the record by key. Much more interesting if we tried to get an early range of records. Now (if we, I emphasize, like fools, and remain within the standard) the results of these 16 SELECT * FROM will have to be combined in the application.
 3r31423.
 3r31423.
What performance changes to expect?
 3r31423.
 3r31423. 3r31282.  3r31423.
Intuitively linear. 3r31297.  3r31423.
Theoretically, sublinear, because Amdahl law . 3r31297.  3r31423.
Practically - maybe, almost linearly, maybe not. 3r31297.  3r31423. 3r31299.
 3r31423. In fact, the correct answer is unknown. With dexterous application of the sharding technique, you can achieve significant superlinear degradation of your application, and DBA will come running with a hot poker.
 3r31423.
 3r31423. Let's see how this can be achieved. It is clear that simply putting the setting in PostgreSQL shards = 1? and then it itself took off - it is not interesting. Let's think about how you can ensure that r3r31166. from sharding 16 times we would have slowed down to 32 3r31167. - It is interesting from the point of view, how not to do this.
 3r31423.
 3r31423. Our attempts to accelerate or slow down will always rest on the classics - in the good old Amdahl law, which says that there is no perfect parallelization of any query, there is always some consistent part.
 3r31423.
 3r31423.

Amdahl law
 3r31423.


 3r31423.
3r31166. Always 3r31167.
There is a serialized part.
 3r31423. 3r31411.
 3r31423. There is always a part of the execution of the request that is parallel, and there is always a part that does not parallel. Even if it seems to you that there is a perfectly parallel query, at least the collection of the result string that you are going to send to the client, from the strings received from each shard, is always there, and it is always consistent.
 3r31423.
 3r31423. There is always some consistent part. It can be tiny, completely imperceptible against the general background, it can be gigantic and, accordingly, strongly affecting parallelization, but it always is.
 3r31423.
 3r31423. In addition, its influence
3r31166. changing
and can grow significantly, for example, if we cut our table - let's raise the rates - from 64 records to 16 tables with 4 records, this part will change. Of course, judging by such huge amounts of data, we work on a mobile phone and an 86 2 MHz processor, we don’t have enough files that can be kept open at the same time. Apparently, with such introductory, we open one file at a time.
 3r31423.
 3r31423. 3r31282.  3r31423.
It was
Total = 3r34031403.
Serial +
Parallel
. Where, for example, parallel is all the work inside the DB, and serial is sending the result to the client.
 3r31423. 3r31297.  3r31423.
It became
Total2 = Serial + Parallel /N + Xserial.
For example, when a generic ORDER BY, Xserial> 0.
 3r31423. 3r31297.  3r31423. 3r31299.
 3r31423. With this simple example, I'm trying to show that some kind of Xserial appears. In addition to the fact that there is always a serialized part, and the fact that we are trying to work with data in parallel, an additional part appears to support this data slicing. Roughly speaking, we may need:
 3r31423.
 3r31423. 3r31282.  3r31423.
find these 16 tables in the internal database dictionary; 3r31297.  3r31423.
open files; 3r31297.  3r31423.
allocate memory; 3r31297.  3r31423.
delocate memory; 3r31297.  3r31423.
results; 3r31297.  3r31423.
synchronize between the cores; 3r31297.  3r31423. 3r31299.
 3r31423. Any out-of-sync effects will still appear. They can be insignificant and take one billion dollars from the total time, but they are always non-zero and always are. With their help, we can dramatically lose in performance after sharding.
 3r31423.
 3r31423. Theory of Sharding  3r31423.
 3r31423. This is a standard picture about the law of Amdal. It is not very readable, but it is important that the lines, which should ideally be straight and linearly grow, abut on the asymptote. But since the schedule from the Internet is unreadable, I have made, in my opinion, more vivid tables with numbers.
 3r31423.
 3r31423. Suppose that we have a certain serialized part of the processing of the request, which takes only 5%:
serial = ??? = 1 /20.
 3r31423.
 3r31423. Intuitively, it would seem that with the serialized part, which takes only 1/20 of the request processing, if we parallelize the processing of the request for 20 cores, it will become approximately 2? at worst 1? times faster.
 3r31423.
 3r31423. Actually math is heartless :
 3r31423.
 3r31423. 3r33788. wall = ??? + ??? /num_cores, speedup = 1 /(??? + ??? /num_cores)
 3r31423.
 3r31423. It turns out that if you carefully calculate, with the serialized part of 5%, the acceleration will be 10 times (10.3), and this is 51% compared to the theoretical ideal.
 3r31423.
 3r31423. 3r33524.  3r31423. 3r33571.  3r31423. 3r3-3579. 8 cores
 3r31423. 3r3-3579. = ???r38282.  3r31423. 3r3-3579. 3r33511. = 74%
3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 10 cores
 3r31423. 3r3-3579. = ???r38282.  3r31423. 3r3-3579. 3r33511. = 69% 3r3-3581. 3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579.
20 cores
3r33582.  3r31423. 3r3-3579.
= 10.3 r3r31403. 3r33582.  3r31423. 3r3-3579.
3r33511. = 51%
3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 40 cores
 3r31423. 3r3-3579. = 13.6 r3-33582.  3r31423. 3r3-3579.
= 34%
3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 128 cores
 3r31423. 3r3-3579. = 17.4 r3r3582.  3r31423. 3r3-3579.
= 14%
3r33582.  3r31423. 3r33584.  3r31423. 3r33586.
 3r31423. Using 20 cores (20 disks, if you like) for the task on which one worked earlier, we will never even theoretically get more than 20 times the acceleration, and in practice much less. Moreover, with an increase in the number of parallels, inefficiency is greatly increasing.
 3r31423.
 3r31423. When only 1% of the serialized work remains, and 99% is parallelized, the acceleration values ​​are somewhat improved: 3r3-31409.  3r31423.
 3r31423. 3r33524.  3r31423. 3r33571.  3r31423. 3r3-3579. 8 cores
 3r31423. 3r3-3579. = ???r38282.  3r31423. 3r3-3579. 3r38080. = 93% 3r3-3581. 3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 16 cores
 3r31423. 3r3-3579. = 13.9 r3-33582.  3r31423. 3r3-3579. 3r38080. = 87%
3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 32 cores
 3r31423. 3r3-3579. = ???r3r3582.  3r31423. 3r3-3579. 3r33511. = 76%
3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 64 cores
 3r31423. 3r3-3579. = 39.3 r3-33582.  3r31423. 3r3-3579. 3r33511. = 61%
3r33582.  3r31423. 3r33584.  3r31423. 3r33586.
 3r31423. For a completely thermonuclear query that is naturally executed for hours, and the preparatory work and the assembly of the result take very little time (serial = ???), we will see already good efficiency: 3r31409.  3r31423.
 3r31423. 3r33524.  3r31423. 3r33571.  3r31423. 3r3-3579. 8 cores
 3r31423. 3r3-3579. = ???r38282.  3r31423. 3r3-3579. 3r38080. = 99%
3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 16 cores
 3r31423. 3r3-3579. = ???r38282.  3r31423. 3r3-3579. 3r38080. = 99%
3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 32 cores
 3r31423. 3r3-3579. = ??? r3r3582.  3r31423. 3r3-3579. 3r38080. = 97% 3r3-3581. 3r33582.  3r31423. 3r33584.  3r31423. 3r33571.  3r31423. 3r3-3579. 64 cores
 3r31423. 3r3-3579. = ???r???.  3r31423. 3r3-3579. 3r38080. = 94%
3r33582.  3r31423. 3r33584.  3r31423. 3r33586.
 3r31423. Please note
100% we will never see
. In particularly good cases, you can see, for example, ???%, but not exactly 100%.
 3r31423.
 3r31423.

How to chuff and repeat N times?
 3r31423.


 3r31423. It is possible to zashardit and vtormozit so exactly N times:
 3r31423.
 3r31423. 3r3393931.  3r31423.
Send requests docs00 docs15
consistently
and not in parallel. 3r31297.  3r31423.
In simple queries, make a sample of
not
by key
, WHERE something = 234. 3r31297.  3r31423. 3r3393939.
 3r31423. In this case, the serialized part (serial) occupies not 1% and not 5%, but approximately 20% in modern databases. You can get 50% of the serialized part, if you access the database using a wildly effective binary protocol or link it as a dynamic library to a Python script.
 3r31423.
 3r31423. The rest of the processing time of a simple request will be occupied by non-parallelizable operations of parsing the request, preparing the plan, etc. That is, it does not read the record.
 3r31423.
 3r31423. If we divide the data into 16 tables and start sequentially, as is customary in the PHP programming language, for example, (he is not very good at starting asynchronous processes), then we’ll get a slowdown 16 times. And maybe even more, because network round-trips will also be added.
 3r31423.
 3r31423.
Suddenly, when sharding, the choice of programming language is important.
 3r31423. 3r31411.
 3r31423. Remember the choice of programming language, because if you send queries to the database (or search server) sequentially, then where does the acceleration come from? Rather, there will be a slowdown.
 3r31423.
 3r31423.
Bike from the life of
 3r31423. 3r31037.
 3r31423. If you choose C ++,
write to POSIX Threads
and not on boost i /o. I saw an excellent library from experienced developers from Oracle and MySQL, who wrote a chat with the MySQL server on Boost. Apparently, at work they were forced to write on pure C, and then they managed to turn around, take Boost with asynchronous I /O, etc. One problem is that this asynchronous I /O, which theoretically should have driven 10 requests in parallel, for some reason it had an imperceptible synchronization point inside. When you run 10 requests in parallel, they were executed exactly 20 times slower than one, because 10 times for the requests themselves and once again for the synchronization point.
 3r31423.
 3r31423.
Conclusion:
write in languages ​​that implement parallel execution and waiting for different requests well. I do not know, to be honest, what exactly is there to advise, besides Go. Not only because I love Go, but because I don’t know anything more suitable.
 3r31423.
 3r31423.
Do not write in irrelevant languages ​​
, on which you will not be able to run 20 parallel queries to the database. Or at every opportunity do not do it all by hand - understand how it works, but do not do it manually.
 3r31423.
 3r31423.

Bike from the A /B test
 3r31423.


 3r31423. Still sometimes you can slow down, because you are used to it, that everything works, and did not notice that the serialized part, firstly, is, secondly, large.
 3r31423.
 3r31423. 3r31282.  3r31423.
Immediately ~ 60 search index shards, categories
 3r31423.
These are correct and true shards, under the subject area. 3r31297.  3r31423.
There were up to 1000 documents, and there were 5?000 documents. 3r31297.  3r31423. 3r31299.
 3r31423. This is a bike from the production, when the search queries were changed a little and they started to choose a lot more documents from the 60 shards of the search index. Everything worked quickly and according to the principle: “It works — don't touch it”, they all forgotten that there are actually 60 shards inside. Increased the sampling limit for each shard from a thousand to 50 thousand documents. Suddenly it began to slow down and the parallel ceased. The requests themselves, which were performed on shards, flew quite well, and the stage slowed down, when 50 thousand documents were collected from 60 shards. These 3 million final documents on one core merged together, sorted, the top of 3 million was selected and given to the client. The same serial part slowed down, the same merciless Amdal law worked.
 3r31423.
 3r31423. 3r31166. So maybe you should not do sharding with your hands, but simply humanly
 3r31423. say database: “Do it!” 3r3-31409.  3r31423. 3r31167.
 3r31423.
Disclaimer:
I do not really know how to do something right. I type from the wrong floor !!!
 3r31423.
 3r31423. I have all my life promoting a religion called "algorithmic fundamentalism." It is briefly formulated very simply:
 3r31423.
 3r31423.
You do not want to do anything really with your hands, but it is extremely useful to know how it is arranged inside. So that at the moment when something goes wrong in the database, you at least understand what went wrong there, how it is arranged inside and around, how to fix it.
 3r31423. 3r31411.
 3r31423. Let's look at the options:
 3r31423.
 3r31423. 3r3393931.  3r31423.
"Hands"
. Previously, we manually divided the data into 16 virtual tables, we rewrote all queries with our hands - this is extremely uncomfortable to do.
If you can not shard hands - do not shuffle hands!
But sometimes this is not possible, for example, you have MySQL 3.2? and then you have to. 3r31297.  3r31423.
"Automatic".
It happens that you can shard with an automaton or almost with an automaton, when the database is able to distribute the data itself, you just have to roughly write somewhere a specific setting. There are a lot of bases, and they have a lot of different settings. I am sure that in every database in which there is an opportunity to write shards = 16 (whatever the syntax), a lot of other settings are glued to this case by a locomotive. 3r31297.  3r31423.
The "semi-automatic" 3r3-331403. - Absolutely cosmic, in my opinion, and brutal mode. That is, the base itself doesn’t seem to know how, but there are external additional patches. 3r31297.  3r31423. 3r3393939.
 3r31423. It is difficult to tell something about an automaton, except for how to send to the documentation on the corresponding database (MongoDB, Elastic, Cassandra, in general, the so-called NoSQL). If you are lucky, then you just pull the switch “make me 16 shards” and everything will work. At the moment when it does not work itself, the rest of the article may be necessary.
 3r31423.
 3r31423.

Pro semiautomatic
 3r31423.


 3r31423. Mostly, the delights of information technology inspire chthonic horror. For example, MySQL out of the box did not have a sharding implementation to certain versions exactly, nevertheless, the size of the bases used in battle grow to indecent values.
 3r31423.
 3r31423. Suffering humanity in the face of individual DBA has been tormented for years and writes several bad sharding solutions, built incomprehensibly on what. After that, one more or less decent sharding solution is written under the name ProxySQL (MariaDB /Spider, PG /pg_shard /Citus, ). This is a well-known example of this same snip.
 3r31423.
 3r31423. ProxySQL as a whole, of course, is a complete enterprise-class solution for open source, for routing and so on. But one of the tasks to be solved is the sharding for the database, which in itself cannot be human shard. You see, there is no “shards = 16” switch, you have to either rewrite each request in the application, and there are many of them in places, or put some intermediate layer between the application and the database, which looks: “Hmm SELECT * FROM documents? Yes, it should be broken into 16 small SELECT * FROM server1.document? SELECT * FROM server2.document2 - to this server with such login /password, to this with another. If one did not answer, then "etc.
 3r31423.
 3r31423. Exactly intermediates can do this. They are a little less than for all databases. For PostgreSQL, as I understand it, at the same time there are some built-in solutions (PostgresForeign Data Wrappers, in my opinion, built into PostgreSQL itself), there are external patches.
 3r31423.
 3r31423. Configuring each specific patch is a separate giant topic that does not fit in one report, so we will discuss only the basic concepts.
 3r31423.
 3r31423. Let's talk the best about the theory of buzz.
 3r31423.
 3r31423.

Absolute perfect automatics?
 3r31423.


 3r31423. The whole theory of high in the case of sharding in this letter F (), the basic principle of 3r-31402. always
the same roughly: shard_id = f (object).
 3r31423.
 3r31423. Sharding is all about what? We have 2 billion records (or 64). We want to split them into several pieces. There is an unexpected question - how? According to what principle should I scatter my 2 billion records (or 64) on the 16 servers available to me?
 3r31423.
 3r31423. The latent mathematician in us must suggest that in the end there is always some magic function that, for each document (object, line, etc.), determines in which piece to put it.
 3r31423.
 3r31423. If you go deeper into mathematics, this function always depends not only on the object itself (the line itself), but also on external settings such as the total number of shards. The function, which for each object should tell where to put it, cannot return a value greater than there are servers in the system. And the functions are slightly different:
 3r31423.
 3r31423. 3r31282.  3r31423.
shard_func =
F1
(object);
 3r31423. 3r31297.  3r31423.
shard_id =
F2
(shard_func, ); 3r31297.  3r31423.
shard_id =
F2
(3r3-31402. F1 3r3-331403. (Object), current_num_shards, ). 3r31297.  3r31423. 3r31299.
 3r31423. But further we will not dig in these jungle of separate functions, we will just talk what magic functions F () are.
 3r31423.
 3r31423.

What are F ()?
 3r31423.


 3r31423. They can come up with many different and many different mechanisms.implementation. Sample summary:
 3r31423.
 3r31423. 3r31282.  3r31423.
F = 3r34031. rand
()% Nums_shards
 3r31423.
F = 3r34031. somehash
(object.id)% num_shards
 3r31423.
F = object.date% num_shards
 3r31423.
F = object.user_id% num_shards
 3r31423.
3r31297.  3r31423.
F = shard_table[somehash() |… object.date |… ]3r31297.  3r31423. 3r31299.
 3r31423. An interesting fact - you can naturally scatter all the data randomly - we throw the next entry on an arbitrary server, on an arbitrary kernel, into an arbitrary table. There will be no happiness in this, but it will work.
 3r31423.
 3r31423. There are slightly more intelligent methods of sharding by reproducible or even consistent hash functions, or sharding by some attribute. Let's go through each method.
 3r31423.
 3r31423.
F = rand ()
 3r31423. 3r31037.
 3r31423. Scattering radomom - not very correct method. One problem: we scattered our 2 billion records per thousand servers randomly, and we don’t know where the record is. We need to pull out user_? but we don’t know where it is. We go to a thousand servers and go through everything - somehow it is inefficient.
 3r31423.
 3r31423.
F = somehash ()
 3r31423. 3r31037.
 3r31423. Let's scatter users in an adult way: read the reproducible hash function from user_id, take the remainder from dividing by the number of servers and contact the right server immediately.
 3r31423.
 3r31423. 3r31166. And why are we doing this? And then, that we have highload and we don’t have anything else in one server. If intermeddle, life would be so simple. 3r31167.
 3r31423.
 3r31423. Great, the situation has already improved, in order to get one record, we are going to one previously known server. But if we have a range of keys, then in all of this range we have to go through all the key values ​​and, in the limit, go either to as many shards as we have keys in the range, or to each server in general. The situation, of course, improved, but not for all requests. Some requests suffered.
 3r31423.
 3r31423.
Natural sharding (F = object.date% num_shards)
 3r31423. 3r31037.
 3r31423. Sometimes, that is often, 95% of the traffic and 95% of the load are requests that have some kind of natural sharding. For example, 95% of conditionally socio-analytical queries affect data only for the last 1 day, 3 days, 7 days, and the remaining 5% refer to the last few years. But 95% of requests, thus, are naturally shaded by date, the interest of system users is focused on the last few days.
 3r31423.
 3r31423. In this case, you can divide the data by date, for example, by one day, and follow the response to a request for a report for a certain day or an object from that day on this shard and go.
 3r31423.
 3r31423. Life is improving - we now not only know the location of a particular object, but also know about the range. If we are asked not for a range of dates, but for a range of other columns, then, of course, we will have to go through all the shards. But under the terms of the game, we have only 5% of such requests.
 3r31423.
 3r31423. It seems that we came up with the perfect solution to everything, but there are two problems:
 3r31423.
 3r31423. 3r3393931.  3r31423.
This solution is tailored for a specific case, when 95% of requests involve only the last week. 3r31297.  3r31423.
Since 95% of requests touch the last week, they will all fall on one shard that serves this last week. This shard will melt, while everyone else will stand idle at this time. In this case, they can not be thrown out, the archived data should also be stored. 3r31297.  3r31423. 3r3393939.
 3r31423. Not to say that this is a bad sharding scheme - we have cut off the hot data, nevertheless we have to do something with the hottest shard.
 3r31423.
 3r31423. The problem is solved with grimaces, jumps and poultices, that is, increasing the number of replicas for the burning current day, then gradually reducing the number of replicas when this day becomes the past and goes into the archive. There is no perfect solution called “you just need to smear the data on the cluster just like a magic hash function”.
 3r31423.
 3r31423.
Formally, we know now we know "everything."
True, we do not know one giant headache and two smaller headaches.
 3r31423.
 3r31423.
1. Simple pain: badly smudged
 3r31423. 3r31037.
 3r31423. This is an example from a textbook that almost never occurs in combat, but suddenly.
 3r31423.
 3r31423. 3r31282.  3r31423.
As an example with a date, only without a date! 3r31297.  3r31423.
3r31166. Unintended
uneven (tangible) distribution. 3r31297.  3r31423. 3r31299.
 3r31423. We chose the sharding mechanism, and /or the data changed, and, of course, PM did not communicate the requirements (we do not have errors in the code, always PM does not convey the requirements), and the distribution became monstrously uneven. That is, missed with the criterion.
 3r31423.
 3r31423. To catch, you have to look at the sizes of the shards. We will definitely see the problem at the moment when one of our shards either overheat or becomes 100 times bigger than the others. You can fix it by simply replacing the key or sharding function.
 3r31423.
 3r31423. This is a simple problem, to be honest, I do not think that at least one person in a hundred will run into this in life, but suddenly at least it will help someone.
 3r31423.
 3r31423.
2. “Invincible” pain: aggregation, join 3r3-331409.  3r31423. 3r31037.
 3r31423. How to make a selection that billions of records from one table per billion records from another table?
 3r31423.
 3r31423. 3r31282.  3r31423.
How to “quickly” count WHERE randcol BETWEEN aaa AND bbb?
 3r31423. 3r31297.  3r31423.
How cleverly to do users_32shards JOIN posts_1024 shards? 3r31297.  3r31423. 3r31299.
 3r31423. The short answer is: don’t suffer!
 3r31423.
 3r31423. If you have distributed a billion records per thousand servers in the first table in order for them to work faster, in the second table they did the same, then naturally one thousand per thousand servers should speak in pairs. A million connections will not work well. If we make requests to the database (search, storage, document store or distributed file system), which are badly placed on the sharding, these requests will slow down wildly.
 3r31423.
 3r31423. An important moment -
some requests will always fail unsuccessfully and will slow down
. It is important to try to minimize their percentage. As a result, do not make giant joins with a billion to a billion records. If there is a possibility of a small table, relative to which is joining in a giant, shared table, to replicate to all nodes, you need to do this. If joins are actually local in some way, for example, there is a possibility for the user and his posts to be placed side by side, to have them in the same way, and to do all the joins within the same machine - this should be done.
 3r31423.
 3r31423. This is a separate course of lectures for three days, so we turn to the last hellish pain and to different algorithms for dealing with it.
 3r31423.
 3r31423.
3. Difficult /long pain: Resarding 3r3-331409.  3r31423. 3r31037.
 3r31423. Get ready: if you zashardili your data for the first time in your life, then on average, once again, five you must grind them.
 3r31423.
 3r31423.
How many clusters do not configure, still decide.
 3r31423. 3r31411.
 3r31423. If you are very smart and lucky, then pereshardite, at least once. But once you have to, because at that moment when you think that 10 units are enough for a user, someone at that moment writes a request for 3? and the plans have a request for 100 units of unknown resources. Shardov is never enough. With the first sharding scheme, in any case, you miss - you always have to either increase the number of servers, or do something else - in general, somehow reassemble the data.
 3r31423.
 3r31423. Well, if we have nice degrees of two: there were 16 shard servers, it became 32. More fun, if it was 1? it became 23 - two positively simple numbers. How do the databases do that, maybe they have some magic inside?
 3r31423.
 3r31423. The correct answer is: no, there is no magic inside, they have hell inside.
 3r31423.
 3r31423. Next, we will consider what can be done “by hand”, maybe we will understand it “as an automaton”.
 3r31423.
 3r31423.
In the forehead # 1. Relocate all
 3r31423. 3r31263.
 3r31423. 3r31282.  3r31423.
For all objects we consider NewF (object), we shift to a new shard. 3r31297.  3r31423.
The probability of coincidence NewF () = OldF () is small. 3r31297.  3r31423.
We shift almost
everything at all.
3r31297.  3r31423.
Oh. 3r31297.  3r31423. 3r31299.
 3r31423. Such a hell, how to shift all 2 billion records from old shards to new ones, I hope, is not available anywhere else. The naive approach is clear: there were 17 cars, they added 6 cars to the cluster, sifted 2 billion records, shifted them from 17 cars to 23 cars. Once in 10 years, you can probably even do this. But overall this is a bad move.
 3r31423.
 3r31423.
In the forehead # 2. Relocate half of
 3r31423. 3r31263.
 3r31423. The next naive improvement - let's abandon such a stupid scheme - let's ban 17 cars in 2? and we will always decide on 16 cars in 32 cars! Then, according to the theory, we will have to transfer exactly half of the data, and in practice we will also be able to do it.
 3r31423.
 3r31423. 3r31282.  3r31423.
For all objects we consider NewF (object), we shift to a new shard. 3r31297.  3r31423.
It was strictly 2 ^ N, it became strictly 2 ^ (N + 1) shards.
3r31297.  3r31423.
The probability of coincidence NewF () = OldF () is 0.5. 3r31297.  3r31423.
We transfer about 50% of the data. 3r31297.  3r31423.
Optimally, but works
only for powers of two.
3r31297.  3r31423. 3r31299.
 3r31423. In principle, everything is fine, except for binding to the power of two on the number of machines. This naive approach, oddly enough, may work.
 3r31423.
 3r31423. Please note that the additional fragmentation of the cluster in powers of two in this case is also optimal. In any case, adding 16 machines to a cluster of 1? we are obliged to transfer half of the data - exactly half and shift.
 3r31423.
 3r31423. Well, but really mankind has not invented anything else - the question arises for an inquiring mind.
 3r31423.
 3r31423.
More fun # 3. Consistent hashing
 3r31423. 3r31263.
 3r31423. Of course, a picture with a circle about consistent hashing is required here
 3r31423. 3r31149.
 3r31423.
 3r31423. If you google “consistent hashing”, then the circle will come out, all the issue is populated in circles.
 3r31423.
 3r31423. The idea: let's draw the identifiers of the shards (hashes) on the circle, and on top of that we will mark the hashed identifiers of the servers. When we need to add a server, we put a new point on the circle, and we’ve moved what was close to it (and only what was close to it).
 3r31423.
 3r31423. 3r31282.  3r31423.
When adding a shard: browse
3r31166. Not all 3r31167.
, but only 2 "neighbors", we shift an average of 1 /n.
 3r31423. 3r31297.  3r31423.
When deleting a shard: we look through only the shard that is being deleted, we shift only it. Type optimum. 3r31297.  3r31423. 3r31299.
 3r31423. Very effective in terms of minimizing traffic when adding a shard, and completely disgusting in terms of normal balancing data. Because when we hash all these objects, which we distribute to a large number of machines, we do it relatively unevenly: the points in a circle are unevenly distributed, and the load on each particular node can be very different from the rest.
 3r31423.
 3r31423. This problem is solved by the last line of the virtual node. Each node, each server on the circle is indicated by more than one point. By adding a server, shard, etc., we add a few points. Each time when we delete something, respectively, we delete several points and shift a small part of the data.
 3r31423.
 3r31423. I am talking about this cosmos with circles, because, for example, inside Cassandra is such a scheme. That is, when you start recording between the chapters, know that the circle is looking at you and, probably, does not approve.
 3r31423.
 3r31423. However, in comparison with the first ways, life has improved - we are already looking at adding /removing a shard not all the records, but only a part, and we shift only a part.
 3r31423.
 3r31423. Attention, the question: is it possible to improve yet? And even improve the uniformity of loading shards? - They say that you can!
 3r31423.
 3r31423.
More fun # 4. Rendezvous /HRW
 3r31423. 3r31263.
 3r31423. The following simple idea (the material is the same as the tutorial, therefore nothing complicated):
shard_id = arg max hash (object_id, shard_id).
 3r31423.
 3r31423. Why it is called Rendezvous hashing, I do not know, but I know why it is called Highest Random Weight. It is very simple to visualize it as follows:
 3r31423. 3r3r12313.
 3r31423. We have, for example, 16 shards. For each object (line) that needs to be put somewhere, we calculate 16 hashes, depending on the object from the shard number. Who has the highest value of the hash function, he won.
 3r31423.
 3r31423. This is the so-called HRW-hashing, also known as Rendezvous hashing. Dull as a stick scheme for calculating the number of the shard, firstly, it is easier to see around the eye and gives a uniform load, on the other hand.
 3r31423.
 3r31423. The only downside is that adding a new onethe shard's got worse. There is a risk that when adding a new shard, we still have some hashes that will change and it may be necessary to review everything. The shard removal technology hasn't changed much.
 3r31423.
 3r31423. Another problem is computationally difficult with a large number of shards.
 3r31423.
 3r31423.
More fun # 5. More equipment
 3r31423. 3r31263.
 3r31423. Interestingly, research does not stand still and Google publishes some new space technology every year:
 3r31423.
 3r31423. 3r31282.  3r31423.
Jump Hash - Google ‘2014. 3r31297.  3r31423.
Multi Probe —Google ‘2015. 3r31297.  3r31423.
Maglev - Google ‘2016. 3r31297.  3r31423. 3r31299.
 3r31423. If you are interested in topics, you can read a lot of theses. I cite these data in order to make it clear that the problem is not solved, there is no super-solution that can be implemented in all databases. Until now, people defend dissertations.
 3r31423.
 3r31423.
More fun # 6. Listings 3r3–31409.  3r31423. 3r31263.
 3r31423. For an appetizer, the easiest option is stupid lists. Why do we need all these mega-vehicles? We do not want, in order to manage 2 billion entries, to keep in memory of the cluster on each node a giant object_id list with 2 billion identifiers that would display the location of the object.
 3r31423.
 3r31423. And what if you take and thin out this list? Or not even much?
 3r31423.
 3r31423. Let's at least just count. I am sure that in some of the databases it is used, but I do not know which one. Mathematics says that it can work quite well, and, frankly, you can even manage to do it with pens.
 3r31423.
 3r31423. Let’s estimate:
 3r31423.
 3r31423. 3r31282.  3r31423.
There are 1 billion objects. 3r31297.  3r31423.
We take objects and by identifiers /hashes /dates /anything else we split into a million intervals: min /max_id => shard_id. 3r31297.  3r31423.
A million intervals with 8 byte hashes and 4 byte shard numbers (4 billion shards should be enough for everyone!) - this is 20 bytes for one interval. 3r31297.  3r31423.
In order to place a billion objects somewhere in a cluster, you need to globally support 20 MB in the memory of the entire cluster — not such a large amount of data. 3r31297.  3r31423.
20 MB is a fairly granular data map in a cluster for a very small range of a thousand records. 3r31297.  3r31423. 3r31299.
 3r31423. Compare this with the sharding of 2 billion records with the help of hash functions for at least 16 nodes - more than 100 million with something for a shard. And here we have granularity in the block: the records that we put as a single package on a particular shard are very small - 1 Kb each. It is possible to make optimal any operation, and adding, and deleting a shard.
 3r31423.
 3r31423. I repeat, the option is simple as a stick, complex space technologies are not needed, and for sure it works somewhere, but I did not find where.
 3r31423.
 3r31423.

Conclusions
 3r31423.


 3r31423. There is an important basic technique called sharding named after Gaul Julius Caesar: "Divide and conquer, conquer and divide!" If the data does not fit in one server, it is necessary to break them into 20 servers.
 3r31423.
 3r31423. Having learned all this, you should get the impression that it would be better not to be shard. If you decide that
it would be better not to shardit
- this is the right feeling. If you can add $ 100 of memory to the server and do not shard anything, then you have to do it. When sharding, a complex distributed system will appear with data transfer to and fro, data being stacked is unknown where. If you can avoid it - you need to avoid it.
 3r31423.
 3r31423.
It’s better not to do it with your hands
, it is better that the “base” (search, DFS, ) know how to shard itself. In any case, sooner or later, highload will come and somehow the data will have to be split. Not the fact that even if the base is able to do it itself, you will not run into any problems. Remember about algorithmic fundamentalism - 3r3-31402. you need to understand how everything works inside
.
 3r31423.
 3r31423. Setting up sharding for the first time
Carefully choose
F () 3r3-331403. , think about queries, network, etc. But get ready, probably, it will be necessary to choose 2 times and
at least once you have to redo everything
.
 3r31423.
 3r31423.

About the speaker
 3r31423.


 3r31423. Usually, we talk about the speaker on the beach, but this time there is a reason for an exception. Andrei Aksyonov became one of the winners of 3r313350. Premiums HighLoad ++
and everyone for whom the Aksenov-Sphinx — highload associative chain is not obvious is worth watching the video.
 3r31423.
 3r31423.
3r31419. 3r31419. 3r31419.
 3r31423.

About the training session 3r3-331409.  3r31423.


 3r31423. This was the material on the performance on the training session of the Highload User Group. I'll tell you a little what it is and why.
 3r31423.
 3r31423. We recently realized that the reports were on a big 3r31376. HighLoad ++
go deeper and deeper. At the conference, we cannot repeat and again discuss, for example, questions of architecture. And there are no places where one could get acquainted with basic concepts and systematize fragmentary knowledge. Therefore, we launched the series. training mitap 3-3331408. for the development of highload-applications, sites and services.
 3r31423.
 3r31423. We will talk about scaling services and databases, load testing, machine learning and neural networks, project architectures. Everything connected with the development of non-standard solutions with increased requirements for performance, sustainability, safety, and so on will gradually appear on the agenda.
 3r31423.
 3r31423. On 3r31388. the third mitap is r3-331408.
January 24 in St. Petersburg, 3-3331403. We will discuss the design patterns of high-loaded systems “Queue”, “Pipelining”. The event is free, but you need to register by the link in the description of the mitap. Speakers and topics will be published a little later, or we can send in 3r31392. 3r3-331408 mailing list. .
 3r31423.
 3r31423.
Mailing is also the easiest way to find out about news, for example, that
On April 8 and ? St. Petersburg will host its 3-3331403. 3r31401.
HighLoad ++
and you can already apply or to book early bird ticket.
 3r31423. 3r31411. 3r31419. 3r31423. 3r31423. 3r31423. 3r31416. ! function (e) {function t (t, n) {if (! (n in e)) {for (var r, a = e.document, i = a.scripts, o = i.length; o-- ;) if (-1! == i[o].src.indexOf (t)) {r = i[o]; break} if (! r) {r = a.createElement ("script"), r.type = "text /jаvascript", r.async =! ? r.defer =! ? r.src = t, r.charset = "UTF-8"; var d = function () {var e = a.getElementsByTagName ("script")[0]; e.parentNode.insertBefore (r, e)}; "[object Opera]" == e.opera? a.addEventListener? a.addEventListener ("DOMContentLoaded", d,! 1): e.attachEvent ("onload", d ): d ()}}} t ("//mediator.mail.ru/script/2820404/"""_mediator") () (); 3r31417. 3r31423. 3r31419. 3r31423. 3r31423. 3r31423. 3r31423.
+ 0 -

Add comment