The world of magic PostgreSQL: an interview with Nikolay Samokhvalov

Today we will talk with Nikolai, the "fighter" for the promotion of new technologies in the database world, a member of our program committee and an active participant in various conferences. The main topics are self-managed DBMS, DBA AI, clouds, NoSQL, built-in database monitoring mechanisms, reports on RIT ++ and HighLoad ++ Siberia, as well as lots of useful tips and examples that can be useful in real work for both developer and DBA.
The world of magic PostgreSQL: an interview with Nikolay Samokhvalov  
#RuPostgres , represented on the sites and YouTube. At, by the way, we are the second most numerous postgresovye communities in the world, more than 1600 participants - ahead of the SF Bay Area group, but we are inferior to New York.
Now I'm based in the Valley, I visit Moscow regularly. I help companies to get the most out of Postgres, more and more - in the clouds Amazon and Google.
- What is your focus now in the context of working with data?
One of the hottest topics in the world of databases today is self-managed DBMS . Oracle since last year is actively "drowning" for the fact that their DBMS in the clouds is autonomous and does not need a "driver" (DBA), and Mitap with the inventor of the concept of self-driving databases and the creator of DBMS Peloton Professor Carnegie Mellon University Andy Pavlo 3 r3r3422 are collected. for 200+ people. By the way, he recently accepted our invitation and
will come
on Highload ++ 2018.
In the meantime, Postgres, being, of course, the best open source DBMS and possessing insanely growing popularity , now requires a huge amount of knowledge and manual actions during operation.
I am sure that in the coming years the situation will change for the better. There will be more understandable and automated tools that help both DBA and developers. Tools for automating database tuning and query optimization will be developed.
- So DBA now must have extensive knowledge in various areas? The real work goes beyond knowledge of the database?
DBA activities are far from trivial. Inside and around the database there is always a huge amount of office dаta: internal statistics, logs, monitoring indicators. To find the bottleneck of performance, you need to be able to quickly understand the heap of numbers, and this requires a large amount of knowledge and experience. When evaluating the work of a competent DBA, I often hear phrases "black magic", "excellent intuition". Although in fact, such a DBA is primarily armed with a large amount of knowledge and many years of experience.
And at the same time DBA's work now is purely manual actions! For example: found the most "braking" request in pg_stat_statements . But in order to optimize it, you need a specific query with specific parameters, because this depends both on the execution plan and the query speed, and without them you will not run EXPLAIN (the situation will improve slightly in Postgres 12 if the patch is accepted.) The patch from PostgresPro ). And what does the DBA do now? One of two things: either the specific examples of queries are drilled into the logs, or they "suck out" some of the values, optimizing as a result the spherical horse in a vacuum. The latter option works well, only if you have years of experience behind you. If there is an "intuition".
And further, when a specific solution is selected that eliminates the problem, say, we need to add some index, an even more amusing thing happens. If the DBA is experienced and with access to the "product", then it may be directly "on the go" and it will debug, and even the index will create manually. Passing git, yes. If, however, "well it's very necessary" for the project to run DDL through git, then the index will get the name of the type i_tmp_killme, and developers will be asked to add to the migration system and re-index with a more sane name.
If DBA's authority goes off scale, obedient developers and questions will not be asked. But in companies with a good culture, git flow, code review, devops and curious developers need to explain in advance, before any real actions with "combat" databases, why it is chosen exactly how it accelerates the request. In the Valley, by the way, developers often come across meticulous, they all need to chew, justify. And then very much help the clouds. It's very convenient - in a couple of clicks to create a replica of the combat DB in AWS RDS, on it to run `explain (analyze, buffers)` in different variants, to find the best solution and present it to the developers with specific estimates of performance improvement and detailed diagnostics. In RDS, the processes of creating databases were perfectly automated, but from the mass of manual actions for optimizing requests and tuning the database, RDS does not get rid of it - nobody else (for now!) Will expel you and explain the explanation to the developers.
As a result, the work of Postgres DBA now looks like the management of an excellent high-speed car with a manual gearbox. Do you like to ride "on the handle"? I - yes, but not every day.
- In fact, you briefly described the beginning of an algorithm of actions for finding problematic queries. This knowledge can serve as the basis for creating new useful "superstructures"?
Right. That's why my area of ​​interest now is DBA AI . Artificial intelligence, helping to "prepare" Postgres quickly and effectively, without purely manual actions, on large and growing scale (usually in the clouds). There is no such thing in nature yet, but work in this direction is already underway (one of the interesting, though still purely research developments, is the already mentioned ) Peloton ).
- At the RHT ++ 201? speaking with the report Database First! About common errors of using RDBMS , you said that the DBMS is the heart of the IT system, not using the database capabilities is stupid. What examples can you give in support of your words? First of all, are the facts interesting when it was the use of standard database control mechanisms that helped to avoid mistakes or vice versa, when not using standard features led to disastrous results? For example, the lack of FK and, perhaps, other, at first glance, the usual mechanisms.
In the same report, I argued that "lamentable results" are observed in most projects in which support for the integrity and "purity" of data is derived from the DBMS and implemented on the application side - in PHP, Ruby, Python, or something else. As such a project accumulates data, "dirty data" is collected, such as "orphan strings" (they did not use foreign keys, deleted users, and deleted data attributed to them), duplicates (not implemented or incorrectly implemented for uniqueness on the application side), invalid or erroneous values. Another question is how do you feel about such problems. If this is a small blog, then, perhaps, there is no big trouble. But if it's a billing system Once you "blow" the data from the database, you allow the possibility that someone (person or program) will appear who will bypass this check. Not to mention the fact that your implementation of audits may be far from ideal.
So it is useful to know and apply the capabilities available in the DBMS to support data integrity constraints. There are only a few of them: support for unique values ​​(in practice it is implemented using a unique index), foreign keys, user restrictions (what is achieved by using the keyword CHECK), prohibition of NULL values, and in Postgres still special exclusion constraints , with the help of which it is convenient to ensure, for example, the disjointness of intervals.
The use of suitable data types is also an important tool for ensuring data purity. A simple example. An obvious and very common mistake: using the text (varchar) data type to store e-mail addresses in a column and hanging the usual unique index to a column. Of course, we need case-insensitive checks for email, so the citext data type is better (it's not in the box, but there is the citext extension available in most Postgres installations), or hanging a function index of the type `create index using btree (lower (email)) `. By the way, in the latter case, do not forget to rebuild statistics (`vacuum analyze `), so postgres realizes what distribution in the table has the values ​​of the expression `lower (email)`.
In addition to the correct use of data types and support for various types of integrity constraints, the DBMS makes it possible to implement complex data validation logic - for example, when you need to do some checking using several tables when modifying data in one table. This is a task for triggers. From the perspective of my experience, which includes very different projects and different people, I undertake to assert: dislike for triggers is directly proportional to the DB-ignorance of the developer. Such a simple formula.
No one in their right mind will claim that PL /pgSQL or, say, PL /Python is super-fast. On ordinary arithmetic calculations PL /pgSQL (as, indeed, simple SQL) is noticeable inferior to With and even PHP. So slow that they will be used for such purposes on a significant scale only by a madman (well, or someone who will use, for example, the library .MADlib , Which I, by the way, have great respect and sometimes use with pleasure). But to work with a very large amount of data, when you need to find the right needles in haystacks, there is nothing better than the position "near data" when all the indices available in the database and the absence of network complexity play on your side, and the help of one of the two most popular programming languages ​​in the world - SQL. Do not use these opportunities, when it is exactly profitable, and there is stupidity! A well-written trigger and quickly works, and is fairly easy to debug (for extension and debugging help extensions pg_stat_statements and Auto_explain , With the included options `pg_stat_statements.track = all` and` auto_explain.log_triggers = on`) and, most importantly, is a country that does not overcome dirty data.
- In the continuation of the previous question, tell me why it is the built-in capabilities of PostgreSQL to control and manipulate data - is more optimal and advantageous than self-written designs?
One obvious reason: the built-in capabilities are developed and for many years are being developed by smart people, Postgrass makers - such as Tom Lane .
Another - architectural - reason, we have already discussed, there is nothing to illustrate. That's how many entrances to your house? One? Two? When you leave, how many doors do you close /control? Similarly, not ten? In a modern project, you can simultaneously have a website, and back-office, and various APIs for mobile applications, and can also be for external users. If you implement support for integrity using the means of the application, then in your house there are many doors through which visitors enter and exit (in the case of a database - data). And if you are even more "lucky" and the project develops not by a pair of three programmers, but by a large team or even by a group of teams, then all, hello, arrived. Your doors are made and supported by different people /teams, which often also speak different languages ​​ It's clear what they're talking about, right? Or you have to limit and restrain the development of the project ("you can not connect this ready-made GUI for working with data - after all, the manager will be able to write anything into the database, it's better to create the admin yourself! ") or somehow synchronize the development of some and the same data control mechanisms in different subsystems, often in different languages.
In fact, I'm here to tell trivial things, but my personal observations show how many stereotypes such as the one-sided "zranimki - evil" or "FK - big brakes" are still alive and lead to mass bicycling, and how expensive it is to pay the projects for that later.
- There are a lot of discussions and questions around PostgreSQL releases. Probably, you are often asked about the correct versions of the transition from versions 9.3 - 9.6 to 10.
Is the tool always use ? pg_upgrade Is it justified? Did you meet the situation when you need to be afraid of this tool?

Was very painful bug in version 9.? there were sleepless nights for many (including me) "thanks" to this bug. Fortunately, this is already in the past. Of course, the bugs do not have 100% insurance, but today pg_upgrade is practically an industry standard, its use is reasonable in most situations, when a few minutes of downtime are permissible. Here lucky for those who are already in the clouds and with managed database type AWS RDS - there about him do not think at all, just plan maintenance window and perform an upgrade. If you have to think about this, it's worthwhile to experiment as much as possible, having conducted at least a few test runs on the declined database (in full volume and in the same infrastructure, of course, if data volumes and resources allow). Here, by the way, it's tempting again to use clouds, but already at a lower level - just EC2-machines in Amazon, for example. This is very cheap if you do not forget about the possibility of spot instances .
A fresh and detailed case, like 1500 DB with a total volume of 55 TB in 6 DC, was upgraded in just 15 minutes: . Note how many tests they have done before performing operations "in battle." The main formula of this article is universal: "Test, find problems, fix; lather, rinse, repeat ". Here I really want to start a conversation about the automation of experiments, but I probably already tired you.
If such a short downtime is unacceptable, then it is necessary to consider more time-consuming to implement the solution - first of all, on the basis of pglogical (
? a fresh post on this topic
- On the May RIT ++ you are declared with the report " Continuous Database Administration ". In the description of the speech, the first hourThe article is devoted to the tool "postgres_dba". Tell me, please, about him.
- this is a "semi-automatic" "Swiss knife" for working with Postgres. Any experienced DBA has a collection of useful scripts accumulated over the years, answering various questions - from "which tables in this database occupy the most place" to evaluating bloat, finding unused or redundant indexes, working with pg_stat_statements. I shoveled more than one selection of such scripts and made an interactive menu for them - and now native postgrinal console, psql, you can "talk" with Postgres, getting answers to your questions very quickly. It is very easy to add to this set and any of your reports or interactive scripts (for example, you can add something to add /remove database users with password generation so that it's as secure as possible).
DBA does not make this tool fully automated, of course. But already noticeably accelerates. For myself, I noticed that the resulting tool brings together the DBA and DB with which it /it works: reports are now available at a touch distance of just a couple of keys, the time is greatly saved, a full "feel" DB is very fast, and therefore more often. The project is open, I invite you to try and participate in the development: .
- The second part of the report consists of several blocks devoted to automation, cloud solutions and issues related to AI. What do you think, which of these areas will be actively used in the near future?
There are several directions at once. First, Amazon, Google and Microsoft already provide so-called managed databases - this is when you decide for yourself the issues of database deployment, replication settings, switch-over /fail-over, automatic backups and basic monitoring. But such solutions, although they are based on Open Source products, are now not made in the spirit of FLOSS . AWS RDS does not even allow you to download backups, not to mention the possibility of replication elsewhere, except on another RDS-instance. And Google Cloud SQL for Postgres, though announced in April about GA, is still extremely poor in terms of Postgres configuration. Does not even allow you to log queries, only errors.
But in general, these are all successful stories of constructing proprietary solutions based on open, especially when talking about RDS and RDS Aurora. At the same time, I believe that in the near future there will be open analogs ready to work anywhere (in the Amazon or Google cloud, in a private cloud or on-premise), while no less advanced, with different options for building HA solutions, policy backups, with full access to the OS, FS with superuser-access. According to my observations, this niche is definitely there and it is not occupied yet. By the way, building blocks for this have already been created and run in numerous companies: here and Patroni with Stolon for raising the master and replicas with the support of the auto-failer, and the k8s-operators
? which recently began to develop. from Zalando
and from Crunchy , and solutions for backups WAL-E , WAL-G , pgBackRest .
By the way, engineers from Russia - in general all, not just DBA - I strongly recommend moving your consciousness to the clouds as soon as possible, if they have not already done so. Then a strange story came out. It is well known that in most cases, the delay in adapting new technologies in comparison with the Valley in Russia is two to three years. But in the case of clouds, this lag is clearly larger. In the US, the clouds today are almost 3 r3r3274. commodity
(sorry, I do not know the Russian equivalent word). But we have in the past years, "cloud" reports at conferences RIT /Highload ++ and others were on the finger count. But now, it seems, there is a turning point. For different reasons, people are finally ripening. Improved channels to the European AWS-regions, RTT from Moscow, according to my observations, has already dropped to ~ 50 ms. Yandex and grow their own solutions with might and main - we are very much waiting for you when you can simply register, enter card data and start using. In addition, Roskomnadzor here diligently teaches all the flexibility and mobility.
The second story is a gradual realization that we need another b about More automation than just raising Postgres, configuring replication and backups. If you are more than 35-40 years old and you are a programmer, then perhaps you have found an era of work without version control systems (now everywhere Git, and before it - CSV, SVN and others). People used ftp, the layout designer sent HTML to the programmer. There was a darkness, such a stone age, which is now difficult to imagine. All this is already automated, a special thank you Linus Torvalds , as well as the commands GitHub, BitTorrent, GitLab. The same story with the tests - once there was only manual testing, now all the same autotests and CI. Likewise with the issues of deploy: there are devops-tools, no manual configuration of servers and services, everything is scripted, automated.
So, in the field of DBA now, probably, not the stone age (thanks to cloud companies for at least basic automation), but also far from being iron. Labor DBA is still manual. I have already given examples. There is still a lot to be achieved, new technologies will appear. At us in already undergoes an internal testing of the analog Database Database Replay, which allows to conduct a lot of parallel experiments without manual actions. Of course, above all in the clouds. For example, if you want to select the optimal parameter - say, try different options for the values ​​`shared_buffers`,` seq_page_cost` or `default_statistics_target` - you just need to ask our robot DBA Nancy, it will deploy several copies of your database (one with options options "As it is now on the road", others - with the ones you want to check), will throw a piece of real load from your combat database or requests you manually set, and will reduce the results in one table, where you can see how in general each option I showed myself how long it took, plus a comparison for each cluster of the most demanding system requests. You can do the same if you optimize the set of indexes - you can ask the system to check different indexes.
This is a very important point: often, by manually optimizing one query, you forget to check related queries, and they can easily get worse. A recent example: SELECTs were optimized by replacing part of the table indexes with their lighter partial analogs (`create index where `), but it turned out that the UPDATEs were significantly slowed down. The reason is that after replacing the indexes, stopped working. HOT Updates (The case was analyzed in detail in my article ). Automation experiments using our solution allows you to see the whole picture, all the trade-offs. Only such an automated system allows to guarantee the fulfillment of the basic principle of optimizations - the solution found should improve something concrete without worsening everything else.
And finally, the third direction is the story with DBA AI. Last year, a solution appeared that uses ML to mimic Postgres and MySQL, called OtterTune . It was made by the same team Andy Pavlo. Prior to this, different solutions appeared in proprietary databases - here the project is allocated. AutoAdmin Microsoft started zero, whose operating time is then flowed in the SQL Server itself. Today all such cloudy and autonomous (ostensibly) Oracle, and also Peloton, which I have already mentioned, are one more bell.
This direction is extremely interesting. It is definitely worth waiting for breakthroughs. I will explain why now. Two aspects are very important. The first is the explosive development of the methods of machine learning, which we have been seeing for the last five or six years, new technologies, iron. Second - the development of clouds, their cheaper. Little RHS story. My good friend and physician Dima Pushkarev told at the RHS-2013 , as he reduced the cost of deciphering the genome tenfold due to interesting proprietary algorithms for using spot instances in AWS, it was a big breakthrough. And then in 2015 his company bought the Amazon and absorbed these developments, so now EC2-spots are a workhorse in a huge number of companies, including the largest of the Fortune-50? for example, Disney . We spot very opportunely, conducting experiments becomes economically profitable, without them the appearance of our robot DBA Nancy would be unlikely.
The topic of creating artificial intelligence, which learns on experiments with different databases and loads (including yours) and helps engineers with database configuration, indexes, partitions, query optimization, is not new in the scientific environment and in the enterprise sector (Oracle, Microsoft, and IBM have different solutions for decades), but for us - those who use mostly Open Source - it's a whole new world. Some components are already there, we will discuss them at the forthcoming RIT ++ festival, but if to speak in general, we are at the very beginning of a very interesting way.

- We touched on the topics of the conference, but did not talk about the trends of the industry as a whole. What do you think, the HYIP around NoSQL solutions has already ended? What forecasts can you give about the rivalry between MySQL and PostgreSQL? Or can this be considered a mutually beneficial opposition?
Hypes has changed. In the trend now NewSQL , rather than NoSQL. And in this term different meanings are sewn up. You can select at least three values. The first is the same NoSQL solutions that add support for their SQL dialects. In this respect, SQL is now the winning language. Ugly, redundant, with unbearable standard Talmud (which, incidentally, recently added a section of SQL /JSON, the work on the implementation of which in Postgres is ), But he is now clearly a winning language.
The second story is about traditional RDBMS, extended by the support of violating the first normal form is data, especially JSON data, which ultimately leads us to a hybrid data model - we obtain reports, for example, with JSON-documents inside. In Postgres, there have been arrays, hstore, XML for a long time, and now JSON's excellent support, with a bunch of features and various index variations. In this sense, Postgres is also NewSQL-DBMS.
And finally, the third variant of the interpretation of the concept of "NoSQL": new RDBMS, immediately oriented to the SQL language. This is primarily cloud Google Spanner, open source CocroachDB and ClickHouse. A lot of interesting things are being created, this is a great new challenge to Postgres. Hypes of previous decades - object DBMS, then XML and then NoSQL - it went through very successfully, grew stronger and grew, but time will tell how things go further.
In general, it is not yet clear with the challenges of the NewSQL era. We have to work a lot, new solutions are already hitting the weak spots of Postgres: for example, one of the largest clients of Citus-a CloudFlare recently for its analytics refused to solve them (and hence from Postgres) in favor of ClickHouse . Unsurprisingly, a column database with vectorized computations and built-in sharding at the moment on analytics problems turns out to be better than Postgres, even equipped with Citus shuffling. If the community succeeds in the foreseeable future to add quality support for pluggable engines, the new storage engines are primarily the column store, as well as a row store of a new type (with "undo", replacing the tuples "in place", eliminating the need for frequent and expensive operations VACUUM - see the development of EnterpriseDB zheap ) - and develop, and maybe even build in the core of the Postgres solution for shading (there are already solutions for "side", for example, Citus , but about built-in so far only
? conversations
), then Postgrace will be all right foreseeable future. Personally, I very much hope that in the next few years Postgres-community will successfully solve the listed tasks.
And with MySQL, yes, there is some kind of mutually beneficial How to better say. Probably not a confrontation, but already some kind of cooperation. Postgrersovye conferences invite speakers from MySQL and Percona and on the contrary , joint are made. benchmark (see ? joint report
? Developers PostgresPro and Percona on Highload ++ 2016) and so on. One of the most active participants of the Russian postgres community is Alexey Kopytov, a man from the MySQL community, author sysbench , active "Drown" for MySQL and against Postgres , and for some reason it is very often announced in our postgrass facebook group . I do not know what motivates him, but in the end he does a very good job, does not forget about the shortcomings of Postgres, pushes for development.
[i] - And the main question for today. When all the same, the #RuPostgres Mitap Group will be the first?

There was some personal modest goal - to overtake the group SF Bay Area, where I now live. The goal is achieved. The group of New York is very active, and we are not likely to catch up with it, but it does not bother me particularly. I do not plan to move there yet.
Given the fact that I, and co-host of our broadcasts Ilya Kosmodemyansky basically we are not in Moscow, the group we are now actively pulling online. I think there is still a lot of potential for growth here. Firstly, Russian-speaking post-gresmen are scattered all over the world. Secondly, the means of conducting online Mitapov are becoming more accessible and more perfect - for example, most recently in YouTube the chat was finally saved after the broadcast, which is very important for our online sessions. During the broadcasts, there were often interesting parallel discussions of the topics discussed. And now YouTube now synchronously puts all the replicas on the record, it becomes even more interesting to watch.
- Since you are responsible for database reports to the Novosibirsk HighLoad ++ Siberia, I can not help but ask what topics will be touched this year and which reports should be paid attention in the first place? What will be the specifics of the Siberian forum, is it worth it to come after RH ++?
Of course it does. Highload ++ Siberia - high-grade, "adult" Highload, it should not be confused with last year's Highload ++ Junior. A large conference arrives in a new region to activate the exchange of experience between an even larger number of experts. The program has its own, unique, is now being formed. We try to minimize the intersection of topics, to present new information.
Very interesting applications have already been submitted. For example, a story about the implementation of ClickHouse in VK, very interesting reports from specialists from Yandex,, Odnoklassnikov, Badoo - about databases, video streaming, compression algorithms.
Separately, I note that a whole block of applications received from the enterprise sector. Local companies are ready to share their experience with Oracle, primarily in the financial sector. And so it turned out that in thissince I supported most of the applications - because unlike the past experience (when people from the Russian office of Oracle came to us and could not even answer basic technical questions), there are new expectations, a good technical expertise is visible.
And of course, there will be reports on Postgres. Promised to come Andreas Scherbaum (Greenplum) and Alvaro Hernandez (OnGres), leaders of the German and Spanish postgrass communities.
I'm sure it will be interesting!
+ 0 -

Add comment