Theory and practice of using HBase

Good afternoon! My name is Danil Lipova, our team at Sbertech started using HBase as a data warehouse. In the course of his study, experience accumulated, which he wanted to systematize and describe (we hope that many will be useful). All the experiments below were carried out with versions of HBase ???-cdh??? and ???-cdh???-beta1.
General architecture
Write data to HBASE
Reading data from HBASE
Caching of the data
Batch processing of MultiGet /MultiPut data
Strategy for breaking tables into regions (spiliting)
Fault Tolerance, Compactification and Locality of Data
Settings and performance
Load testing
in theory, when writing data into the cache do not fall and indeed, such parameters CACHE_DATA_ON_WRITE for the table and "Cache DATA on Write" for RS are set to false. However, in practice, if you write data to MemStore, then discard it to disk (clearing this way), then delete the resulting file, then by running get query we will successfully get the data. And even if you completely disable BlockCache and hammer the table with new data, then you get to reset MemStore to disk, delete them and request from another session, they will be extracted from somewhere anyway. So HBase stores in itself not only data, but also mysterious riddles.
hbase (main): 001: 0> create 'ns: magic', 'cf'
Created table ns: magic
Took ??? seconds
hbase (main): 002: 0> put 'ns: magic', 'key1', 'cf: c', 'try_to_delete_me'
Took ??? seconds
hbase (main): 003: 0> flush 'ns: magic'
Took ??? seconds
hdfs dfs -mv /data /hbase /data /ns /magic /* /tmp /trash
hbase (main): 002: 0> get 'ns: magic', 'key1'
cf: c timestamp = 153444069021? value = try_to_delete_me

The parameter "Cache DATA on Read" is set to false. If there are ideas, welcome to discuss it in the comments.

5. Batch processing of MultiGet /MultiPut

Handling single queries (Get /Put /Delete) is quite expensive, so you should merge them into List or List whenever possible, which allows you to get a significant performance boost. This is especially true for the operation of recording, but when reading there is the next underwater rock. The chart below shows the time for reading 5?000 entries from MemStore. Reading was done in one thread and the horizontal axis shows the number of keys in the query. Here you can see that when you increase to thousands of keys in one query, execution time drops, ie, the speed increases. However, with the default MSLAB mode after this threshold, a radical performance drop begins, and the more data in the record, the longer the run time.
Theory and practice of using HBase  
Tests were performed on the virtual machine, 8 cores, version of HBase ???-cdh???-beta1.
The MSLAB mode is designed to reduce fragmentation of the heap, which arises from the mixing of data from the new and old generations. As a solution to the problem when MSLAB is enabled, the data is placed in relatively small cells (chunk) and processed in batches. As a result, when the volume in the requested data packet exceeds the allocated size, the performance drops sharply. On the other hand, turning off this mode is also not advisable, since it will lead to stops due to GC at the time of intensive work with the data. A good way out is to increase the volume of the cell, in the case of an active record via put simultaneously with reading. It is worth noting that the problem does not occur if after the write execute the flush command which resets MemStore to disk or if it is loaded using BulkLoad. The table below shows that requests from the MemStore data of a larger volume (and the same number) result in a slowdown. However increasing the chunksize returns the processing time to the norm.

In addition to increasing chunksize, the data is broken down by regions, i.e. splitting tables. This results in a smaller number of requests for each region and if they are placed in a cell, the response remains good.

6. Strategy of partitioning tables into regions (spiliting)

Since HBase is a key-value repository and partitioning is carried out by the key, it is extremely important to divide the data evenly across all regions. For example, the partitioning of such a table into three parts will result in the fact that the data will be divided into three regions:

It happens that this leads to a sharp slowdown, if the data loaded in the future will look like long values, most of them starting with the same number, for example:
Since keys are stored as a byte array, they all start the same way and refer to the same region # 1 that stores this key range. There are several partitioning strategies:
HexStringSplit - Turns the key into a string with hexadecimal encoding in the range "00000000" => "FFFFFFFF" and filling it with zeros.
UniformSplit - Turns the key into an array of bytes with hexadecimal encoding in the range "00" => "FF" and filling with zeros on the right.
In addition, you can specify any range or set of keys for partitioning and configure auto-splitting. However, one of the simplest and most effective approaches is UniformSplit and the use of hash concatenation, for example, the highest byte of the key through the function CRC32 (rowkey) and rowkey:
hash + rowkey
Then all data will be distributed evenly across regions. When reading, the first two bytes are simply discarded and the original key remains. Also, RS controls the amount of data and keys in the region and, if exceeding limits, automatically breaks it into parts.

7. Fault tolerance and localization of data

Since only one region is responsible for each set of keys, the solution to problems associated with RS drops or decommissioning is storing all the necessary data in HDFS. If the RS is dropped, the wizard detects this by missing a heartbeat on the ZooKeeper node. Then it assigns the serviced region to another RS ​​and since HFiles are stored in a distributed file system, the new host reads them and continues to maintain the data. However, since some of the data can be stored in MemStore and did not manage to get into HFiles, WAL is used to restore the operation history, which are also stored in HDFS. After the rollback of changes, RS is able to respond to requests, however, moving results in the fact that some of the data and their processes are serving on different nodes, i.e. declining locality.
The solution to this problem is major compaction - this procedure moves the files to the nodes that are responsible for them, resulting in a dramatic increase in the load on the network and disks during this procedure. However, in the future access to data is significantly accelerated. In addition, major_compaction merge all HFiles into one file within the region, and also clears the data, depending on the table settings. For example, you can specify the number of versions of an object that you want to save or its lifetime, after which the object is physically deleted.
This procedure can have a very positive impact on the work of HBase. The picture below shows how performance degraded as a result of active data recording. Here you can see how in one table 40 streams were written and 40 threads simultaneously read the data. Writing flows form more and more HFiles, which are subtracted by other flows. As a result, more and more data need to be removed from memory and eventually GC starts working, which practically paralyzes all work. The launch of the major compaction led to the cleaning of the formed blockages and the restoration of productivity.

The test was performed on 3 DataNode and 4 RS (CPU Xeon E5-2680 v4 @ ???GHz * 64 stream). Version HBase ???-cdh???
It should be noted that the launch of the major compaction was performed on a "live" table, in which the data was actively written and read. There was a statement in the network that this could lead to an incorrect answer when reading the data. For verification, a process was started that generated new data and wrote them into a table. Then immediately read and checked whether the received value coincides with what was written down. During the operation of this process, about 200 times the major compaction was started and no failures were detected. Perhaps the problem is rare and only during a high load, so it is more secure still to stop the recording and reading processes and perform the cleaning without allowing such GC drawdowns.
Also, the major compaction does not affect the state of MemStore, you need to use flush (connection.getAdmin (). Flush (TableName.valueOf (tblName))) to flush it to disk and compactify it.

8. Settings and performance

As already mentioned, the greatest success of HBase shows where it does not need to do anything when executing BulkLoad. However, this applies to most systems and people. However, this tool is more suitable for mass stacking of data in large blocks, whereas if the process requires execution of a number of competing read and write requests, the Get and Put commands described above are used. For the determination of the optimal parameters, starts were made with various combinations of the parameters of the tables and settings:
10 threads started simultaneously 3 times in a row (call it a block of threads).
The running time of all the flows in the block was averaged and was the final result of the block operation.
All threads worked with the same table.
Before each thread block started, a major compaction was performed.
Each block performed only one of the following operations:
- Put
- Get
- Get + Put
Each unit performed 5?000 repetitions of its operation.
The recording size in the block is 100 bytes, 1000 bytes or 1?000 bytes (random).
Blocks were started with a different number of requested keys (or one key or 10).
Blocks were run with different table settings. Parameters have changed:
- BlockCache = enabled or disabled
- BlockSize = 65 KB or 16 KB of
- Partitions = ? 5 or 30
- MSLAB = enabled or disabled
So the block looks like this:
a. MSLAB mode was turned on /off.
b. A table was created for which the following parameters were set: BlockCache = true /none, BlockSize = 65/16 Kb, Partitions = 1/5/30.
c. The compression GZ was established.
d. 10 threads started simultaneously doing 1/10 operations put /get /get + put in this table with records of 100/1000/10000 bytes, executing 5?000 requests in a row (random keys).
e. The b-d points were repeated three times.
f. The running time of all threads was averaged.
All possible combinations were tested. It is predictable that if the recording size increases, the speed will fall or that disabling caching will slow down. However, the goal was to understand the degree and significance of the influence of each parameter, so the collected data were fed to the linear regression function input, which makes it possible to estimate the reliability with the help of t-statistics. Below are the results of the work of blocks performing operations Put. Full set of combinations 2 * 2 * 3 * 2 * 3 = 144 options + 72 because some were performed twice. Therefore, in the amount of 216 starts:

The test was performed on a mini-cluster consisting of 3 DataNodes and 4 RS (CPU Xeon E5-2680 v4 @ ???GHz * 64 streams). Version HBase ???-cdh???.
The highest insertion speed of 3.7 seconds was obtained with the MSLAB mode disabled, on a table with one partition, with BlockCache enabled, BlockSize = 1? 100 bytes per 10 pieces in a packet.
The lowest insertion speed of 82.8 sec was obtained with the MSLAB mode enabled, on a table with one partition, with BlockCache enabled, BlockSize = 1? records of 1?000 bytes, 1 piece each.
Now let's look at the model. We see a good quality of the model for R? but it is quite clear that extrapolation is contraindicated here. The real behavior of the system when changing parameters will not be linear, this model is needed not for forecasts, but for understanding what happened within the given parameters. For example, here we see by Student's criterion that for the Put operation the parameters BlockSize and BlockCache (which in general is quite predictable) do not matter:

But the fact that the increase in the number of partitions leads to a decrease in performance somewhat unexpectedly (we have already seen the positive effect of increasing the number of partitions with BulkLoad), although it is understandable. Firstly for processing it is necessary to form queries to 30 regions instead of one, and the amount of data is not such that it yields a win. Second, the total running time is determined by the slowest RS, and since the amount of DataNode is smaller than the number RS, some regions have zero locality. Well, let's look at the top five:

Now let's evaluate the results of execution of the Get blocks:

The number of partitions has lost significance, which is probably explained by the fact that in half of the cases the data is cached and the cache for reading is the most significant (statistically) parameter. Naturally, increasing the number of messages in the query is also very useful for performance. The best results:

Well, finally we'll look at the block model that did get first, and then put:

Here all the parameters are significant. And the results of the leaders:


9. Load testing

Well, finally, let's start a more or less decent load, but it's always more interesting when there is something to compare. On the website DataStax - the key developer Cassandra is results NT series NoSQL storage, including HBase version ???-1. The download was carried out in 40 streams, the data size was 100 bytes, SSD disks. The result of testing the Read-Modify-Write operations showed such results.

As far as I understood, the reading was done with blocks of 100 records and for 16 HBase nodes the DataStax test showed a performance of 10 thousand operations per second.
Fortunately, in our cluster, too, 16 nodes, but not very "well", that each has 64 cores (thread), while in the DataStax test only 4. On the other hand they have SSDs, and we have HDD and more The new version of HBase and CPU utilization during the load practically did not increase significantly (visually by 5-10 percent). Nevertheless, we will try to start on this configuration. The default table settings, reading is performed in the range of keys from 0 to 50 million randomly (that is, in fact, every time a new one). In the table of 50 million records, divided into 64 partitions. The keys are scrambled to crc32. The settings of the default tables, MSLAB is enabled. Running 40 threads, each thread reads a set of 100 random keys and immediately writes the generated 100 bytes for these keys back.

Stand: 16 DataNode and 16 RS (CPU Xeon E5-2680 v4 @ ???GHz * 64 stream). Version HBase ???-cdh???.
The average result is closer to 40 thousand operations per second, which is significantly better than in the DataStax test. However, for the purpose of experiment, conditions can be changed somewhat. Quite unlikely that all work will be conducted exclusively with one table, and also only with unique keys. Suppose that there is a "hot" set of keys that generates the main load. Therefore, try to create a load of larger records (10 KB), also packs of 10? into 4 different tables and limiting the range of requested keys to 50 thousand. The graph below shows the launch of 40 threads, each thread reads a set of 100 keys and immediately writes random 10 KB for these keys back.

Stand: 16 DataNode and 16 RS (CPU Xeon E5-2680 v4 @ ???GHz * 64 stream). Version HBase ???-cdh???.
During the load, the major compaction was started several times, as shown above, without this procedure, the performance will gradually degrade, but at the time of execution, there is also an additional load. Drawdowns are caused by different reasons. Sometimes threads ended up working and while they were restarting there was a pause, sometimes third-party applications created a load on the cluster.
Reading and immediately recording is one of the most difficult scenarios for HBase. If you only make put requests of small size, for example by 100 bytes, combining them into packs of 10-50 thousand pieces, you can get hundreds of thousands of operations per second, and similarly the case with read-only requests. It should be noted that the results are radically better than those obtained by DataStax most at the expense of requests by blocks of 50 thousand
Stand: 16 DataNode and 16 RS (CPU Xeon E5-2680 v4 @ ???GHz * 64 stream). Version HBase ???-cdh???.

10. Conclusions

This system is flexible enough to configure, but the impact of a large number of parameters is still unknown. Some of them were tested, but not included in the resulting set of tests. For example, preliminary experiments showed a minor significance of such parameter as DATA_BLOCK_ENCODING, which encodes information using data from neighboring cells, which is quite understandable for data generated randomly. In the case of using a large number of repetitive objects, the gain can be significant. In general, we can say that HBase gives the impression of a sufficiently serious and well thought-out database, which can be quite productive when dealing with large blocks of data. Especially if there is an opportunity to spread the processes of reading and writing in time.
If something, in your opinion, is not sufficiently disclosed, is ready to tell in more detail. We offer to share their experience or discuss if they disagree with something.
+ 0 -

Add comment