The developers remained unknown. Lecture of Yandex
This report by the head of the ClickHouse development team Alexey Milovidov is an overview of little-known databases. Some of them are obsolete, some have stopped their development and are abandoned. Alexey draws attention to interesting architectural solutions in the examples listed, understands their fate and explains what requirements should meet your open-source project.
- My report will be about the database. Let me ask you at once, what city map is depicted on this slide? All lines go in one direction. Codd .
I will not tell about them. What can be more boring than talking about MySQL, PostgreSQL or something like that? Instead, I will talk about crafting databases.
Manual assembly. Systems that are almost unknown to anyone. They are either designed by one person, or long abandoned.
The first example is EventQL. Please raise your hand if you ever heard of this system. No one, except those who work in Yandex and already listened to my report. So, it's not for nothing that I included this system in my review.
It is a distributed column database engine designed to handle events and analytics. It executes very fast SQL queries, open source versions since July 2? 201? written in C ++, ZooKeeper is used for coordination, except for it there are no dependencies. Something reminds me of this. Our wonderful system, everyone already knows the name. EventQL is about like ClickHouse, but better. Distributed, massively parallel, column-oriented, scaled to petabytes, fast range-queries - everything is clear, this is all we have. Almost full support for SQL 200? realtime inserts and updates, automatic distribution of data on the cluster, and even the ChartSQL language for describing graphs. So cool! This is what we all promise and what we do not have.
Nevertheless, the last commit almost a year ago, there is a site that does not load, you have to look through web.archive.org.
Ask on GitHub - what plans to develop, what will happen next? No one answered.
The system has two developers. One - the developer of the backend, the second - the frontend. I will not show who of them is who, maybe you'll guess yourself. It is made in company DeepCortex. The name seems familiar, but there are many companies with the word Deep and with the word Cortex. DeepCortex is some unknown company from Berlin. The system is being developed since 201? developed for a long time inside, then it was released into the open source and was abandoned after a year.
It looks like this: it was thrown into the air and thought, suddenly someone will notice it or it will fly away somewhere. Unfortunately no.
Another disadvantage is the AGPL license, which is relatively inconvenient. Even if it does not represent serious restrictions for the use of your company, it is still often feared, the legal department may have some points against.
I began to look for what had happened, why it was not being developed. Looked at the developer account, in principle, everything is fine, people live, continues to commit, though, all commits to the private repository. It's unclear what happened.
Whether a person moved to another company and lost interest in support, whether the priorities of the company changed, or some vital circumstances. Perhaps the company itself did not feel very bad, and the open source was made just in case. Or just tired. I do not know the exact answer. If someone knows, please tell me.
But it was all done in vain. First of all, ChartSQL for declarative description of graphs. Now something similar is used in the Tabix data visualization system for ClickHouse. There is an EventQL blog, however, it is currently unavailable, you have to look through web.archive.org, there files .txt. The system is implemented very competently, and if you are interested, it is possible to read the code, see interesting architectural solutions.
About it yet. And the next system wins for all that I will consider, because it has the best, the most tasty name. Alenka system.
I wanted to add a photo of the packaging of chocolate, but I'm afraid there will be problems with copyrights. What is Alenka?
This is an analytical DBMS that performs queries on graphics accelerators. Опенсорс, license Apache ? 1103 stars, it is written on CUDA, it is a little With ++, one developer from Minsk. There is even a JDBC driver. Опенсорс since 2012 year. However, since 2016 the system for some reason no longer develops.
This is a personal project, not the property of the company, but really a project of one person. This is such a research prototype for exploring the possibilities, how you can quickly process data on the GPU. There are interesting tests from Mark Litvinchik, if interested, you can look at the blog. Probably, many have already seen there his tests, that ClickHouse is faster than all.
I have no answer, why the system is abandoned, only guesses. Now the person works in the company nVidia, probably, it's just a coincidence.
This is a great example, because it increases interest, horizons, you can see and understand how you can do how the system can work on the GPU.
But if you are interested in this topic, there are a bunch of other options. For example, the MapD system.
Who heard about MapD? Offending. It's a bold startup, also developing a GPU database. Recently released in the open source under the license Apache 2. I do not know what it is for, good or vice versa. This startup is so successful that it is laid out in the open source or vice versa, it will close soon.
There is PGStorm. If you are all here in PostgreSQL, then you should hear about PGStorm. Also open source, is developed by one person. From closed systems there is BrytlytDB, Kinetica and the Russian company Polymatika, which makes the system of Business Intelligence. Analytics, visualization and all that. And for data processing, too, can use graphics accelerators, it may be interesting to see.
Is it possible to make something cooler than the GPU? For example, there was a system that processed data on an FPGA. This is the company Kickfire. She supplied her solution in the form of iron with the software at once. True, the company has long since closed, this decision was quite expensive and could not compete with other such cases, when a vendor brings you this cabinet, and everything magically works for you.
Next, there are processors that have instructions for accelerating SQL - SQL in Silicon in new models of SPARC processors. But you do not need to think that you are in the Assembler write join, there is no such. There are simple instructions that either make the decompression by some simple algorithms and a little bit of filtering. In principle, it can not only accelerate SQL. For example, Intel processors have a set of SSE 4.2 instructions for processing strings. When it appeared somewhere in 200? the Intel site had an article "Using new instructions from Intel processors to speed up XML processing." Here is about the same. Instructions useful for speeding up the database can also be used.
Another very interesting option is the transfer of the task of filtering data in part to an SSD. Now SSD has become quite powerful, it's a small computer where there is a controller, and basically you can download your code if you try hard. You will read the data from the SSD, but immediately filter and transfer to your program only the necessary data. Very cool, but it's still at the research stage. Here is an article on VLDB, read.
Next, a kind of ViyaDB.
It was opened just a month ago. "Analytical database for unsorted data". Why "unsorted" is rendered in the title, it is unclear why to make such an accent. What, in other databases only with sorted it is possible to work?
Everything is fine, the source code for GitHub, Apache 2.0 license, is written in the most modern C ++, everything is fine. The developer is one, but nothing.
Links from the slide: to the site and on Khabr
What I liked most about what you can take an example is the excellent preparation for the launch. So I'm surprised that no one heard. There is a wonderful site, there is documentation, there is an article on Habr, there is an article on Medium, LinkedIn, Hacker News. So what? All this in vain? You did not look at any of this. They say, Habr is not a cake. Well, maybe, but a great thing.
What is this system?
Data in the operating system, the system is working with aggregated data. Constant preaggregation is carried out. System for analytical queries. There is some initial support for SQL, but it is just beginning to be developed, initially it was necessary to write queries in some JSON. Of interesting features is that you give it a request, and it writes your code to C ++ itself, this code is generated, compiled, dynamically loaded and processed by your data. As if your request will be processed as optimally as possible. Ideally specialized code in C ++ written for your query. There is a scaling, and Consul is used for coordination. This is also a plus, as you know, it's cooler than ZooKeeper. Or not. I'm not sure, but it seems that yes.
Some of the prerequisites, of which this system is based, are somewhat contradictory. I'm a big enthusiast of various technologies, and I do not want to scold anyone. It's just my opinion, maybe I'm wrong.
The premise is that in order to constantly write to the system new data, including including backdating, an hour ago, a day ago, a week event. And at the same time immediately to drive on these data, analytical queries.
The author claims that for this purpose the system must necessarily be in-memory. This is not true. If you are interested, why, you can read the article "Evolution of data structures in Yandex.Metrics". One person in the hall read.
It is not necessary to store the data in the RAM. I will not say what to do and what system to install, if you are interested in solving this problem.
What good can you learn? An interesting architectural solution is code generation in C ++. If you are interested in this topic, you can pay attention to such research project DBToaster. The research development of the EPFL Institute is available on GitHub, Apache 2.0. The code on Scala, you there give SQL request, this code generates to you source codes on With ++ which data from somewhere will read and process in the most optimum image. Probably, but not sure.
This is just one approach for code generation, for query processing. There is even a more popular approach - the generation of code on LLVM. The bottom line is that your program dynamically writes code in Assembler. Well, not really, on LLVM. As an example, there is MemSQL. This is originally an OLTP database, but for analytics it's also a good idea. Closed, proprietary, initially it used C ++ for code generation. Then we switched to LLVM. Why? You wrote the code in C ++, should compile it, and this takes five precious seconds. And it's good, if your queries are more or less the same, you can generate the code once. But when it comes to analytics, there you have ad hoc queries, and it is possible that each time they are not only different, but even have different structures. If code generation on LLVM, there go milliseconds or tens of milliseconds, in different ways, sometimes more.
Another example is Impala. Also uses LLVM. But if we talk about ClickHouse, there is also code generation, but basically ClickHouse relies on vector processing of the request. The interpreter, which works on arrays, therefore works very fast, like systems like kdb +.
Another interesting example. The best logo in my review.
The first and only relational open source database designed specifically for data warehouses and business intelligence. Available on GitHub, Apache 2 license. GPL, but it was replaced, and correctly done. It is written in Java. The last commit was six years ago. Initially, the system was developed by the non-profit organization Eigenbase, the organization's goal was to develop the framework, the most extensible code base for the database, which not only OLTP, but for example, one for analytics, the LucidDB itself, the other StreamBase for processing streaming data.
What was six years ago? Good architecture, a well extensible code base, more than one developer. Excellent documentation. Now nothing loads, but through WebArchive it is possible to look. Excellent SQL support.
But something is wrong. The idea is good, but it was done by a nonprofit organization for some donations, and a couple of startups were there. For some reason, everything was bent. We could not find financing, there were no enthusiasts, and all these startups were closed long ago.
But not everything is so simple. All this was not in vain.
There is such a framework - Apache Calcite. It is like a frontend for a SQL database. It can parse queries, analyze, perform any optimization conversions, compose a query execution plan, provide a ready JDBC driver.
Imagine that you suddenly woke up, you were in a good mood, and you decided to develop your own relational database. Whether it is not enough, happens. Now you can take Apache Calcite, you only need to add data storage, data reading, query processing, replication, fault tolerance, shuffling, everything is simple. Apache Calcite is based on the LucidDB code base, which was so advanced that it took the entire frontend from there, which is now used in almost a few adaptable forms in almost all Apache, Hive, Drill, Samza, Storm, and even MapD products, despite the fact that it is written in C ++, somehow connected this code to Java.
All these interesting systems use Apache Calcite.
The next system is InfiniDB. From these names dizzy.
There was Calpont, originally the InfiniDB proprietary system, and it was such that sales managers contacted our company and sold us this system. It was interesting to participate in this. They say that an analytical DBMS, wonderful, faster than Hadoop, is column-oriented, naturally, all queries will work quickly. But they did not have a cluster then, the system was not scaled. I say that there is no cluster - we can not buy. I see, six months later the version of InfiniDB 4.0 came out, added integration with Hadoop, scaling, everything is fine.
Another six months passed, and the source is available in the open source. I then thought what I was sitting on, working on something, I had to take it, I had it ready.
They began to look at how they could adapt and use. A year later the company went bankrupt. But the source code is available.
This is called posthumous open source. And this is good. If some company starts not to feel very well, it is necessary that at least some legacy remains, so that others can use it.
This all was not in vain. Based on the sources of InfiniDB, MariaDB now has a table engine called ColumnStore. In fact, this is InfiniDB. The company is no more, people now work in other places, but the legacy remains, and it's wonderful. About MariaDB know everything. If you use it, and you need to fasten it on a quick analytic column-oriented engine, you can take ColumnStore. Secretly, this is not the best solution. If you need the best solution, then you know who to go to and what to use.
Another system with the word Infini in the title. They have a strange logo, this line seems to bend downward. And yet an incomprehensible font, for some reason there is no antialiasing, as if painted in Paint. And all the letters are large, probably to frighten competitors.
Links from the slide: to the site , on GitHib
I am an enthusiast of all kinds of technologies, I respect very much all interesting solutions. I do not scoff, do not think.
What was this system? This is no longer an analytical system, it is OLTP. A system for processing transactions on extreme scales. There is a site, the merits of this system is that the site is loaded. Because when I look at all the others, I'm used to the fact that there will be parking of domains or something else. Sources available. Now the license is GPL. Earlier was AGPL, but fortunately, the author quickly changed it. Written in C ++, more than one developer, posted in the newsprint in November 201? and in January 2014 has already been abandoned. A month and a half. Why? What is the point? Why do this?
OLTP database with initial SQL support, personal project, no company is behind it. The author himself at Hackers News says that it is posted in the open source in the hope of attracting enthusiasts who will work on this product.
This hope is always doomed to failure. You have an idea, you are well done, you are an enthusiast. So, you also need to do this idea. Hardly anyone else can be inspired by this. Or you will have to try very hard to inspire someone. So it's hard to hope that from nowhere on the other side of the world a person appears who will begin to add someone else's code to GitHub.
Secondly, perhaps just underestimation of complexity. DBMS development is not an adventure for 20 minutes. It's complicated, long, expensive.
This is a very interesting case, many heard, RethinkDB. This example is not an analytical framework, not an OLTP, but a document-based one.
This system changed the concept many times. She was rethought. For example, in 2011 it was written that this is the engine for MySQL, which is a hundred times faster on SSD, so it was written on the official website. Then it was said that this is a system with the memcached protocol, also optimized for SSD. And after a while, it's a database for real-time applications. That is, in order to subscribe to the data and take updates directly in real time. Say, all sorts of interactive chat rooms, online games. Trying to find a niche. Document-oriented system, JSON data model. In this regard, this system is often compared with MongoDB. Although this is unfair. What do developers think about MongoDB that behaved well? MongoDB must die. It's not my words, I do not want anyone to be evil, so Oleg said from the company "PostgreSQL professional".
And in general, what do such developers think? Mongo - they are all doing the wrong thing. They could not properly implement the consensus protocol, and even with the task of saving data the system does not cope very well. It seems that in newer versions with this better, it was not especially earlier.
What is RethinkDB? Replication is done correctly, people have implemented the RAFT consensus protocol. The query language is wonderful because it is built into client libraries, and developers write queries so that it's fun to write queries. Not stupid JSON, but something like LINQ or even more convenient. The query language ReQL, written in C ++, which is not surprising, the same Mongo in C ++.
Links from the slide: to the site , on GitHib
But that's not all. Really cool site. The system has long been developed, much more than one developer, just super documentation on this system, and most importantly, community support. So good that this is the best example from which we must take an example. 20938 stars on the GitHub. This is already something beyond the clouds.
The system is being developed so far, but if you look at the schedule of commits, it is clear that there was a period when it was being developed actively, but now it has gradually died out. Why? What's wrong?
This is a start-up, in 2009 received investment, further search for a niche, a search for the best way to position this product. Unfortunately, the startup did not grow, in 2016 the company closed. The developers went to work in another company, and it would seem that this is all, but no. And this is very good, because the developers were able to create a wonderful community around their system, and thanks to donations, it was possible to redeem the rights to the RethinkDB product, the name, logo and everything else, and transfer them to The Linux Foundation. At the same time they changed the license from AGPL to Apache ? which is much less restrictive. Now the product is completely free, I do not want to commit.
Development continues, new releases are coming out, and in principle, I recommend, if you are interested, it makes sense to see what it is. The system is wonderful, I really think so.
If you are wondering why the startup has collapsed and why it ceased to develop actively, that is, a story about mistakes from the founder of the company. This is very valuable, because I'm not going to tell you for some fragmentary information, but it really says what mistakes were made in the opinion of the founder of the company.
We pass further. Sometimes it happens that not separate DB management systems cease to be relevant, but whole directions. For example, the direction of XML databases was popular about 15 years ago.
If you open a news site during this time, the middle of the 2000s, you can see funny quotes. For example, in the future, XML technology will play an increasingly important role in processing and storing data. Says the top manager of a company that began to develop these wonderful databases. This future has already come, and then passed by.
Links from the slide: to the site , on GitHub
Consider one example, this is the database management system Sedna. A natural XML database, developed at the Institute for System Programming of the Russian Academy of Sciences. Do not think that there are professors with punch cards walking along dark corridors. For example, one of the main developers of this system is now working in the company Yandex. Of course, he no longer develops this Sedna, everything is long forgotten, now he makes one super DBMS, which is simply a miracle, everything is done right and even better. I hope, then this person himself will tell about it.
The last commit in 201? and is not specifically developed, because it is already irrelevant. XML databases were popular, now unpopular, nobody needs.
I could not but add to my report a separate section - DBMS from Konstantin Knizhnik.
Konstantin Knizhnik is a person who should be entered in the Guinness Book of Records as a person who alone developed the largest number of databases. There is a personal site garret.ru, judge for yourself, what interesting DBMS. I am fully confident that all these systems work reliably. I went to the personal site of this person, everything is beautifully described, documentation, architecture. And there is his personal address and telephone number.
He continues to regularly write new DBMS and engines. 2014 - IMCS, an extension for PostgreSQL, is designed to store and process time series. Connect to PostgreSQL, the system there is not so well integrated into SQL, it is available in the form of some table functions, and in part its own language. But you can write type select, create a time series and the like. Judging by the case-case, this is for analytics of stock data. I'm sure it's not just designed, but for real-world tasks, real customers. And it's very cool when one person can make a specialized system. Wrote, ready, and works better than anything else. Specially done what they ordered.
Why does it happen that some open-source products are abandoned? The reasons can be classified.
First and foremost, to whom does this all belong, who did it? The first example is the simplest, if our projector is a personal project. Everything is simple: tired, changed some circumstances, lost interest. And in the end, how much time you can spend on your project personally. Or just underestimation of labor.
If the start-up is too easy. We received funding, the startup does not grow, the company closes. We could not choose a niche for which to position our product, we could not get the next round of investments.
The most frequent example is the company's third-party product. The company does not deal exclusively with what makes this product a product. She does something different, but as a bonus - here you are, please, we laid out, do something about it.
There may have been several developers or even one inside the company, but he left.
Another example - the company just thinks, why waste resources? Our developers sit and saw this code, better let them do something else.
Or, for example, laying out in the open source in connection with the bankruptcy of the company. This example is positive. If the company is no longer there, and the code is there, you can use it, it's useful.
Another example - when something is spread in the open source for some misunderstanding. I'm not very comfortable talking about this in St. Petersburg, but here an example of thethe KPHP.
Next example is the development of the institute. They did research, defended the thesis, it is not necessary to develop further, the research is completed.
What does it take to make a real live newsprint so that everyone is happy: both the developers, the company, and the people who use it? First of all, scaling up the development. It is necessary that more than one person develops it. I try very hard not to write the code myself. How to scale? As well as DB: sharding, replication.
Clear positioning. It is necessary for the system to solve a task that is really needed by all. And it is better that no one else could solve this problem, or no one else could solve it so well.
Focusing on a specific niche. For example, if you go to the site of a startup, and there it is written that we have a multimodel DB, it is graphical, analytical, OLTP, strictly consistent, document-oriented and, to the heap, navigation and post-fire. Maybe this site is better closed? Maybe this system does not manage to solve at least some of these tasks well enough. If you have any juz-case, then, most likely, this is a specific task, which you must be able to solve well. For this, we need a system that solves this problem well.
Reliable support of the parent company. Here without comment. Not a restrictive license, so that in other companies do not scare the legal department, these people are all afraid. The advantages of your system should be based on fundamental reasons. For example, if you have a database for processing XML - it's somehow not very. Maybe no one needs to store XML data anymore. And if you have a document-oriented database, it's different. Everyone needs to store documents, and no matter what. In addition, support for community development is very important for a good open source. This means not only that you need to measure pull-requests. This means - you need people to feel that you are, you exist, answer questions, the product develops. This is what will make a good and live open source. That's all, thank you.
It may be interesting
Your post is very helpful to get some effective tips to reduce weight properly. You have shared various nice photos of the same. I would like to thank you for sharing these tips. Surely I will try this at home. Keep updating more simple tips like this. buffet catering service Dudley
Ants removal service