How did we kill ourselves in one click, placing the site and billing on the geocluster or once again we'll talk about the redundancy of
Yes, I too am a moron. But this I did not expect from myself. It seems to be "not the first year married." It seems to be reading a bunch of smart articles on fault tolerance, redundancy, etc., something sensible once even wrote himself here. Over 10 years I am the CEO of the hosting provider working under the brand ua-hosting.company and providing hosting and server hosting services in the Netherlands, the US, and literally a week ago and in the UK (do not ask why the name ua, the answer can be found in our autobiographical article ), we provide customers with solutions of varying degrees of complexity, that even they themselves find it difficult to understand what they created.
But damn Today I surpassed myself. We ourselves completely demolished the site and billing, with all the transactions, customer data on services and others, and I was to blame, I myself said "delete". Some of you have already noticed this. It happened today, Friday at 11:20 am Eastern American Time (EST). And our site and billing were not on one server, and not even in the cloud, we left the data center cloud 2 months ago in favor of our own solution. All this was placed on a fault-tolerant geo-cluster of two virtual servers - our new product, VPS (KVM) with dedicated drives , INDEPENDENT VPS, which were located on two continents - in Europe and in the United States. One in Amsterdam, and the other in Manassas, near Washington, the fact that D.C. In two reliable data centers. The content on which is constantly and in real time duplicated, and fault tolerance is based on the usual DNS cluster, requests could come to any of the servers, anyone performed the role of MASTER, and in case of inaccessibility took on the tasks of the second.
I thought that it could kill only a meteorite, well, or something similar global, which could disable two data centers simultaneously. But everything turned out to be simpler. advertising article .
One in the Netherlands, and the other in the United States. Yes, on these nodes, in addition to our site and billing, there are also 2 real clients who can influence the work of our site in theory and can not do it in practice. Why - it is written in an advertising article, I will not go into details here for the second time. Now it's not about that. In general, the solution is no worse than the dedicated entry-level servers and can handle a very large load.
Among other things, it is fault tolerant, the data is constantly replicated in real time. And in case of unavailability of one server, the role of MASTER will take the second one. Ideally, you can still do so, that traffic from the American continent will be processed by an American server, and from Europe, Russia and Asia - a server in the Netherlands.
The servers we tied into our account in our billing of WHMCS, a public licensed product, but adapted for us, which is used by a lot of hosting providers around the world, including us, since writing our own accounting system is frank debilism (in our case) . Especially in cases when the desired function is implemented by writing your own module to the existing billing, which increases your fault tolerance, as it reduces the risk of having critical vulnerabilities. After all, alone or even a small team, you can not write a more reliable system than the existing one, which was written over the years by a tuya heap of developers, where thousands of bugs have already been copied and for which developers now ask for only $ 30 /month for a license and receive millions of dollars per year , which can be spent, including on further improvements.
By the way, about critical vulnerabilities, recently our programmer made a mistake when writing one of the service modules, which had access to the reading-only billing database that an independent Pentester discovered and asked us to pay $ 550 for the bug found, since it was an SQL vulnerability -injection:
SQL-injection is in the top 10 OWASP, I wrote you about the amount of $ 55? this is the minimum amount, because the database suffers, thereby compromising user data.
But some amounts reach $ 1?000 as a reward, as an example in the case of vk.com.
Of course, we supported such a beginning and paid the fee without questions. Since our programmer has studied the data provided and confirmed the existence of the problem, the pentaster's rationale. After all, we do not yet keep our own pentester in the state, and this work requires considerable knowledge and time, as it includes a whole series of studies:
The security audit of the entire resource, and this is a check on the following parameters, and our report on the audit completion, includes:
• A1 Implementation of the code
• A2 Incorrect authentication and session management
• A3 Cross-site scripting
• A4 Access control violation
• A5 Unsafe
• A6 Leakage of sensitive data
• A7 Insufficient defense against attacks
• A8 Cross-site query forgery
• A9 Using components with known vulnerabilities
• A10 Insufficient logging and monitoring
Because yes, the decision was made unequivocally and quickly. Moreover, as noted by the pentester, similar surveys increase the security of the web as a whole:
This is my hobby, if every developer, like you, would have a dialogue with bug-hunters, the Internet would be safe by 80%.
Because in general, we paid quite a bit, especially if we divide the sum by the number of months that the employee in charge of penetration testing was not kept in the state. Thanks a huge pentester for the bug found and the fact that he gave us time, we really really appreciate it. If anyone needs his services - please contact us, we will provide contacts with his permission.
But this time we were not killed by vulnerability. It was we and the feature of the work of the WHMCS product. On each node, we have a convenient product for managing virtual containers - VM Manager, which WHMCS has access to create, suspend, and delete, and also for clients - to manage the created virtual container.
Every day at WHMCS we get tens and even hundreds of orders that need to be accepted, deleted, or marked as Fraud, if the client tries to pay for the order with a stolen credit card. Sometimes there is a boom of such orders and we can not immediately determine which status to assign to it, since we conduct our internal audit or require the user to identify oneself properly if to us his order seemed suspicious, and such users, of course, do not always answer or pass identification was successful. Therefore, from time to time accumulate a thousand or two not activated orders or orders with unknown status, which are easier to remove than to process. Who really needs - perezakazhet.
Two months ago, we decided to completely abandon the cloud-based data center product, as we began to provide our own solution with VM Manager, which allows you to put the system in one click or even from your image:
And even offered it on NVMe PCIe SSD drives, which are 10 times faster than regular SSD for reading and up to 3 times for writing, the solution, like the cloud one, needs to be upgraded, the servers cost from $ 15 and include a convenient control panel for VM Manager and ISP Manager 5 on request for free, support upgrade with a minimum step of 5GB DDR4 RAM, 60GB NVMe PCIe SSD and 3 cores E5-2650 v4 up to a larger tariff plan in Amsterdam, Manassas and London:
VPS (KVM) - E5-2650 v4 (3 Cores) /5GB DDR4 /60GB NVMe SSD /1Gbps 5TB - $ 15 /month
VPS (KVM) - E5-2650 v4 (6 Cores) /10GB DDR4 /120GB NVMe SSD /1Gbps 10TB - $ 30 /month
VPS (KVM) - E5-2650 v4 (9 Cores) /15GB DDR4 /180GB NVMe SSD /1Gbps 15TB - $ 45 /month
VPS (KVM) - E5-2650 v4 (24 Cores) /40GB DDR4 /480GB NVMe SSD /1Gbps 40TB - $ 120 /month
VPS (KVM) - E5-2650 v4 (24 Cores) /65GB DDR4 /780GB NVMe SSD /1Gbps 65TB - $ 195 /month
VPS (KVM) - E5-2650 v4 (24 Cores) /70GB DDR4 /840GB NVMe SSD /1Gbps 70TB - $ 210 /month
VPS (KVM) - E5-2650 v4 (24 Cores) /75GB DDR4 /900GB NVMe SSD /1Gbps 75TB - $ 225 /month
Therefore, it makes no sense to rent a huge part of the data center cloud and offer the old E3-1230 processors to customers, even though from $ ??? per month for us has dried up. We believe that customers should get maximum quality and maximum performance at the lowest price, yes, we can not offer the product for $ ??? and maybe we do not cover the needs of some developers who have enough minimum resources and any performance, but the node costs more than 7000 euros and we can not afford, in any case so far, to place on it more than 15 customers, as we are ready to guarantee the quality. And quality implies not only stability, but also the maximum ratio of performance /price, then cost-effectiveness.
On joys we canceled all the cloud infrastructure (and these are thousands of VPS), we ordered 2 independent virtual servers (yes, we pay ourselves for our servers), deployed a site and billing on a new solution 2 months ago, as described all above, brought in the defense group, so that the system did not stop itself, if suddenly forgot to pay on time It seems to have done everything.
And today, after 2 months, we decided to "Cancel" (do not delete, such a button is also there, but we try never to delete anything, so that there always was a story) 1000+ waiting orders that have not yet been assigned the status in the WHMCS billing . You guessed it? Yes that's it. I was asked - can I cancel? I confirmed the "delete".
Sometimes, despite the large amount of resources, since the data sample is large and some process does not fit within the allotted time limit, WHMCS issues a 504 error, while everything is running and billing continues to work, but here we got unavailable. Billing and the site are no longer available. We did not immediately understand the reason. But then realized. The order for our 2 VPS was not accepted (yes, we did not accept the same order!) And as a result of the "Canceled" system, which led to the launch of the module and the removal of two containers, supposedly not created, but still created, using our beloved VM Manager. Going to one of the nodes, as expected, our administrators saw the picture "Goodbye":
What is it - the shortage of WHMCS developers, which leads to the removal of unacknowledged orders, and actually created with their VPS IDs, when they are canceled, or our stupidity (sales department) is no longer important. The result was one - "Farewell site with billing". The panel just wiped them. And the administrators to us (sales), there was only one question:
Nahera create a service with its main site and billing.
And then kill her to hell.
And although we had backups, also in two geographically dispersed regions, I felt uneasy. Since I was not sure for the freshness of the backups, I was not sure that our administrators did everything right, as was originally written in those. the task that the database was backed up really every hour or even more often, and the data was updated and several previous versions of the files were stored. That backups for some software error did not stop at all (after all, I personally did not control it, why should I be sure that our administrators will worry about our data if I scored this control?). A lot of negative thoughts Do not let the universe get through this!
I was already thinking that at least 1 hour, and even worse, there will be no transactions, and we will have to restore customer payments manually, correlate data on previous transactions and write to account holders on the account that we have re-created the account and paid it , to show oneself on the non-inferior side, pazozhdal notification that we are fools and allowed such a software failure And if there is no fresh backup - so it is generally a pipe, it would have to be very long and drearily restore everything
In this case, we have an internal table, where many basic data are manually duplicated and updated by us, which eliminates software failure and rewriting incorrect data. Despite the existence of backups - we still use this method. After all, no one cancels the possibility of a global zvizdeets.
Fortunately, everything turned out not so bad, and even those. specialist, who had to solve the problem and who in the beginning announced:
The evening was a success, thank you all.
I went to pick it up.
Still the evening was a success. Since initially the solution provided for the use of lvm and a new virtual container had not yet been created, it was possible to restore the actual data, although with a dance with a tambourine:
All through the lvm utility, with the help of its commands, restored the virtual group of volumes, then the virtual one, then activated the partition, smanunted to the left folder, created the server, and there the data was zakinkali. It was possible in other ways, but this option in our case was the fastest + specificity of the settings of virtual servers, that each has its own raid.
What conclusions have been drawn? Reservation and redundancy should include vulnerability accounting and the most stupid development scenario, when everything, even backups, can be destroyed. We did not suffer and did not incur large losses only due to the fact that the data were not completely removed. If you need to restore from backups - there would be a loss of transactions for the period per hour and a significant loss of working time. It seemed to us that the probability, when we can use backups with the use of a geocluster is minimal - we were wrong. We did not consider that it is possible to remove both servers at once and that we will not delete the servers, but we.
It is always necessary to have an external storage independent of your system, with access, preferably only by some code, which is also reserved to ensure that the data is not lost. At the moment, despite the availability of backups in our infrastructure in two regions, I seriously consider the possibility of using something like Amazon Glacier, although the latter is very expensive. According to the administrators, everything is fine there only in the marketing plan, but when you start to use - you face the fact that the solution is quite expensive, since you have to pay for each request and every file that is very interestingly considered to be their aws-cli application, especially if the data needs to be restored. Recently, one client from Britain asked to set up a reservation there, declined after several months of use - it turned out to be very expensive. But still, we need to decide on what is more expensive. And if the budget for the reservation there does not exceed the amount of possible damage as a result of the loss of some data - we will definitely use it. If not, we will start looking for another, better price, but still an independent decision from us. To provide additional reliability and confidence that the data will not be lost.
Well, as for the uptime, it is not so important, any losses from downtime are replenished, especially if you offer a unique product. Therefore, one should not concentrate on excessive fault tolerance, it is better to add redundancy, in particular redundancy, in storing backups, because in case of data loss, no downtime will seem to you very scary.
P.S. The events took place today, on Friday (published on Friday, according to EST time). Sorry for a lot of letters, decided to unsubscribe, while fresh in the memories. I hope that my experience will be useful to someone and will save you from such a disaster. And on Friday you will enjoy the evening before the weekend, and not write an article about mistakes, as I did. Although what not to eat is for the best, things could be much worse. Feel free to share your fakapami in the comments. All pleasant coming and already come days off!
On the rights of advertising.
VPS (KVM) E5-2650 v4 (6 Cores) 10GB DDR???GB SSD 1Gbps in the Netherlands until December free of charge when paying for a period of six months, you can order here , 30% discount for the first VPS payment in the Netherlands, the USA, England here .
Dell R730xd is 2 times cheaper? Only we have 2 x Intel Dodeca-Core Xeon E5-2650v???GB DDR4 6x480GB SSD 1Gbps 100 TV from $ 249 in the Netherlands and the USA! Read about that How to build a building infrastructure. class with the use of servers Dell R730xd E5-2650 v4 cost 9000 euros for a penny?
It may be interesting