Fault Injection: your system is unreliable if you have not tried to break it

Hi, Habr! My name is Pavel Lipsky. I am an engineer, I work in the company Sberbank-Technology. My specialization is testing the fault tolerance and performance of backends of large distributed systems. Simply put, I break other people's programs. In this post I will talk about fault injection - a testing method that allows you to find problems in the system by creating artificial failures. To begin with, I came to this method, then we will talk about the method itself and how we use it.
 
 
Fault Injection: your system is unreliable if you have not tried to break it 3r311. 3r3997.
 
The article will be examples in Java. If you don’t program in Java, that's okay, it’s enough to understand the approach and the basic principles. Apache Ignite is used as a database, but the same approaches are applicable to any other DBMS. All examples can be downloaded from my GitHub .
 
 
3r33891. Why do we need all this?
 
I'll start with the story. In 200? I worked at Rambler. By that time, the number of Rambler users was rapidly growing, and our two-tier architecture “server - database - server - applications” could no longer cope. We were thinking about how to solve performance problems, and paid attention to the memcached technology.
 
 
 
 
What is memcached? Memcached is a hash table in RAM with the ability to access stored objects by key. For example, you need to get a user profile. The application is addressed to memcached. If there is an object in it, then it is immediately returned to the user. If there is no object, then the database is accessed, the object is formed and put into memcached. Then, during the next call, we no longer have to make a resource-intensive call to the database — we will get the finished object from the RAM - memcached.
 
 
Due to memcached, we noticeably unloaded the database, and our applications started working much faster. But, as it turned out, it was too early to rejoice. Along with the increase in productivity, we have new problems.
 
 
3r3342.
 
 
When you need to change data, the application first makes a correction to the database (2), creates a new object, and then tries to put it into memcached (3). That is, the old object must be replaced by a new one. Imagine that at that moment a terrible thing happens - the connection between the application and memcached is broken, the memcached server or even the application itself crashes. This means that the application could not update the data in memcached. As a result, the user goes to the site page (for example, his profile), sees the old data and does not understand why this happened.
 
 
Was it possible to detect this bug during functional testing or performance testing? I think that, most likely, we would not have found it. To search for such bugs there is a special type of testing - fault injection.
 
 
Usually during fault injection testing, there are bugs, which are popularly called 3r33352. floating 3r33333. . They appear under load, when more than one user is working in the system, when abnormal situations occur - equipment fails, electricity is cut off, the network fails, etc.
 
 
3r33891. The new IT system of Sberbank
 
A few years ago, Sberbank began building a new IT system. What for? Here are the statistics from the site of the Central Bank:
 
 
 
 
The green part of the column is the number of cash withdrawals at ATMs, the blue part is the number of transactions for payment for goods and services. We see that the number of non-cash transactions is growing from year to year. After a few years, we will need to be able to handle the growing workload and continue to offer new services to customers. This is one of the reasons for creating a new IT system for Sberbank. In addition, we would like to reduce our dependence on Western technologies and expensive mainframes, which cost millions of dollars, and switch to open source technologies and a low-end server.
 
 
Initially, we laid the foundation for Apache Ignite technology at the heart of Sberbank. More precisely, we use the paid Gridgain plugin. The technology has a fairly rich functionality: combines the properties of a relational database (there is support for SQL queries), NoSQL, distributed processing and data storage in RAM. Moreover, when you restart the data that was in RAM, will not be lost. Starting with version 2.? Apache Ignite introduced Apache Ignite Persistent Data Store distributed disk storage with SQL support.
 
 
I will list some features of this technology:
 
 
3r3993.  
3r3995. Storage and processing of data in RAM 3r3-31003.  
3r3-3000.  
3r3995. Disk storage
 
3r3-3000.  
3r3995. SQL support
 
3r3-3000.  
3r3995. Distributed task execution 3r3-31003.  
3r3-3000.  
3r3995. Horizontal scaling
 
3r3-3000.  
 
The technology is relatively new, and therefore requires special attention.
 
 
The new IT system of Sberbank physically consists of many relatively small servers assembled into one cluster-cloud. All nodes are identical in structure, equal to each other, perform the function of storing and processing data.
 
 
Inside the cluster is divided into so-called cells. One cell is 8 nodes. Each data center has 4 nodes.
 
 
3r3128.
 
 
Since we use Apache Ignite, in-memory data grid, then, accordingly, all this is stored in server-distributed caches. And the caches, in turn, are divided into identical pieces - partitions. On the servers they are presented as files. Partitions of the same cache can be stored on different servers. For each partition in the cluster there are primary (primary node) and backup nodes (backup node).
 
 
The main nodes store the main partitions and process requests for them, replicate data to the backup nodes (backup node), where the backup partitions are stored.
 
 
While designing a new Sberbank architecture, we came to the conclusion that the system components can and will fail. Say, if you have a cluster of 1000 iron low-end servers, then from time to time you will have hardware failures. RAM strips, network cards and hard drives, etc. will fail. We will consider this behavior as completely normal behavior of the system. Such situations should be handled correctly and our clients should not notice them.
 
 
But it is not enough to design the system's stability to failure, it is necessary to test the systems during these failures. As the well-known distributed systems researcher Caitie McCaffrey from Microsoft Research says: "You will never know how the system behaves in an emergency situation until you reproduce this emergency situation."
 
 
3r33891. Lost updates
 
Let us examine a simple example, a banking application that simulates money transfers. The application will consist of two parts: the Apache Ignite server and the Apache Ignite client. The server part is a data storage.
 
 
3r3615. 3r31616. CacheConfiguration
cfg = new CacheConfiguration <>(CACHE_NAME);
cfg.setAtomicityMode (CacheAtomicityMode.ATOMIC);
try (IgniteCache
cache = ignite.getOrCreateCache (cfg)) {
for (int i = 1; i <= ENTRIES_COUNT; i++)
cache.put (i, new Account (i, 100));
3r-331011. System.out.println ("Accounts before transfers");
printAccounts (cache);
PachTotalBalance (cache); 3r3-31011.
3r3-31011. 3r3-31011. 3r3-31011. Private static void transferMoney (IgniteCache 3r3-33617. Cache, int fromAccountId, int toAccountId) {
Account fromAccount = cache.get (fromAcountId); ipur-? ? 311011. Account fromAccount = cache.get (fromAcountId); 3r3-31011. 3r3-31011. Int amount = getRandomAmount (fromAccount.balance); 3r3-331011. If (amount < 1) {
Return; 3r3-31011.}
3r-31011. FromAccount.withdraw (an) a; a; a; a; a; i will be used by; it (amount);
cache.put (fromAccountId, fromAccount);
cache.put (toAccountId, toAccount);
}
3r3622.
 
The client application connects to the Apache Ignite server. Creates a cache, where the key is the account ID, and the value is the account object. A total of ten such objects will be stored in the cache. In this case, initially for each account we put $ 100 (so that there was something to transfer). Accordingly, the total balance of all accounts will be equal to $ ?000.
 
 
Next, the client application connects to the server node and performs 100 random money transfers between these 10 accounts. For example, from account A to another account B, $ 50 is transferred. Schematically, this process can be depicted as follows:
 
 
3r3208.
 
 
The system is closed, transfers are made only inside, i.e. The total balance must remain at $ 1000.
 
 
 
 
Run the application.
 
 
3r33630. 3r3633.

 
We received the expected total balance value of $ 1000. Now let's complicate our application a bit - let's make it multi-tasking. In reality, several client applications can work simultaneously with the same account. We will launch two tasks that will simultaneously make money transfers between ten accounts.
 
 
3r3615. 3r31616. CacheConfiguration
cfg = new CacheConfiguration <>(CACHE_NAME);
cfg.setAtomicityMode (CacheAtomicityMode.ATOMIC);
cfg.setCacheMode (CacheMode.PARTITIONED);
cfg.setIndexedTypes (Integer.class, Account.class);
try (IgniteCache
cache = ignite.getOrCreateCache (cfg)) {
//Initializing the cache.
for (int i = 1; i <= ENTRIES_COUNT; i++)
cache.put (i, new Account (i, 100));
3r3-31011. System.out.println ("Accounts before transfers");
System.out.println (
PrintAccounts (cache);
PrintTotalBalance (cache); 3r3-31011.
IgniteRunnable run1 = new MyIgniteRunnable (cache, ignite, 1);
List
Arr = Arrays.asList (run? run2);
Ignite.compute () .run (arr); 3r3-31011.}
3rr31011.
Account fromAccount = cache.get (fromAccountId); 3r3011011. Account toAccount = cache.get (toAccountId);
Int. .} 3r3-31011. 3r3-31011. Int fromAccountBalance BeforeTransfer = fromAccount.balance;
int toAccountBalanceBeforeTransfer = toAccount.balance;
fromAccount.withdraw (amount);
toAccount.deposit (amount);
cache.put (fromAccountId, fromAccount);
cache.put (toAccountId, toAccount);
}
3r3622.
 
 
3r33630. 3r3633.
+ 0 -

Add comment