If you want to create something really cool, you need to dig deeper and know how your code works in the system, on hardware
Habr, greetings! I wonder how many programmers and developers have discovered data science or data engineering, and are building a successful career in the field of large data. Ilya Markin, Software engineer at Directual , - just one of the developers who switched to data engineering. Talked about the experience in the role of timlida, a favorite tool in data engineering, Ilya talked about conferences and interesting specialized channels of javists, about Directual from the user side and technical, about computer games, etc.
- Ilya, thanks for taking the time to meet. Congratulations on the relatively recent transition to a new company, and with the birth of your daughter, the hassle and worries you have now a lot. Immediately the first question: what were you so interested in doing in Directual, that you left DCA?
- Probably, it is necessary to tell before, than I in DCA was engaged. In the DCA (Data-Centric Alliance), I got after the program " Specialist for large data ". At that moment, I was actively interested in the topic of big data and realized that this is the area in which I want to develop. After all, where there is a lot of data, interesting engineering problems that need to be solved, too, is enough. The program helped me quickly immerse myself in the world's big date ecosystem, where I got the necessary initial knowledge about Hadoop, YARN, the Map-Reduce paradigm, HBase, Spark, Flink, and much more, and how it works under high load.
Data Engineer ", Then Artem Marinov and Vasya Safronov from Directual came to speak to us. Artem, by the way, once interviewed me at DCA (again about the benefits of networking), and now he invited me to talk. They needed a skalist, but they were ready to consider a javista, who understands how jvm works under the hood. So I was here.
- What is so interesting you were offered to do in Directual? What attracted you?
- Directual - this is an ambitious start-up, which embodies all the declared projects, that is, it does what it promises. I was pleased to be part of the team and take an active part in all the implementations. And for me it was important that the company pays for itself by working with clients, and does not live on the money of investors.
I will tell a little about the project, both from the user and from the back side.
The slogan Directual is "Let people create!". This is the main idea - to enable anyone who does not have the knowledge and experience in writing code to program in our visual editor.
How it works: the user through the browser in our platform can "roll up cubes" (read - the functional nodes of a process) - that is, to compile a script, according to which incoming data will be processed. The data can be completely arbitrary. Processed data at the output can have a different view - from the report in PDF before sending the notification to several administrators. Simply put, any business process can be programmed in minutes, while not being able to write code. The company works in two directions - boxed solutions for corporate clients, as well as a cloud option for a wide range of users.
In order to make it clearer how this works, I will give a few examples.
In any online store there are a number of functional stages ("cubes" in our case) - from showing the goods to the customer before adding to the basket and delivering the delivery to the end user. Using the platform, we can collect and analyze dаta: the frequency of purchases, the time of their completion, the user's path, etc., which will allow us to interact more closely with customers (for example, to develop seasonal offers, individual discounts). However, this does not in any way mean that our platform is a designer for creating online stores!
Directual perfectly copes with both the automation of logistics processes and the work of hr-direction of large companies, and with the creation of any other technological solution - from a farm for growing greenery to a smart home. On the platform, for example, you can create a telegram in a few clicks - we almost every employee who writes the kernel of the system has his own bot. Someone did an assistant librarian, someone - a bot, which helps to learn English words.
We partly "select" the work of some programmers, because now there is no need to contact them for help, prepare TK, check the performance of the work. Now it is enough just to know how your business should work, we must understand the processes themselves, the rest is done by us.
- Listen, but software for a farm for growing green, for example, has long been there. What are you different about?
- Yes, it is true, specific solutions for green production farms exist. However, you do not develop this software yourself, you buy a ready-made solution. With the help of our platform, you can customize software for yourself, for your business and your tasks, you do not need to hire developers.
- And what exactly do you do?
- The company is divided into 2 parts: the development of the core of our system and the project office, which, in fact, is our zero customer, if one can say so. I am developing the core of the system.
As I said, we want to enable any person to work on our platform. For this, we are working on our cloud. And here there are many problems. What is the complexity? For example, there are 1?000 users, they have several data flow scenarios, and each thread has 10-20 cubes-branches. Imagine the load on the iron. And we need to be able to clearly delineate everything, so that the processes of one client do not interfere with the processes of the other, do not slow down the work. If one client has a problem that we need to solve, then we should not hurt the work of another client.
Since the user does not need to think about how this all works under the hood, he is free from the choice of storage. We support different databases - these can be both relational databases and NoSql. In general, the system behaves in the same way. But the client does not need to think about this - when creating an account, depending on the tasks, the system will help to make an optimal choice of storage.
Our platform is a good example of a highly loaded distributed system, and my task is to write good code so that it works smoothly. As a result, here I got what I wanted: I work with those tools that interest me.
- And how did you come to the field of data management?
- At my first job, I mainly dealt with the same tasks in a rather narrow segment (read - parsil xml :)), and it quickly became unrewarding to me. I started listening to podcasts, I realized what a great world is around, so many technologies that everyone around me say - Hadoop, Big Data, Kafka. Then I realized that I had to study, and the program " Specialist for large data ". As it turned out, I did not fail: the first module (processing and analysis of web logs: MapReduce, Hadoop, Machine Learning, DMP-systems - author's comment) was very useful for me, I wanted to study it, but the second module about recommendation systems I just did not know where to apply, I never touched it. And then I went to DCA to work with what I'm interested in. There, a colleague told me that in addition to data-scientist, there is also a data engineer in this area, told who he is and how useful the company is.
After that, you just announced the pilot launch of the program " Data Engineer "Of course, I decided to go. Some of the products that were on the program, I already knew, but for me it was a good overview of the tools, structured everything in my head, finally understood what the data engineer should work with.
- But most companies do not share these two posts, two professional human profiles, they are trying to find universal specialists who will collect data and prepare them, and the model will be done, and under highload it will be withdrawn. What do you think, what is the reason for this, and how correct is this?
- I really enjoyed the performance on the program "Specialist on large data" Paul Klemenkov (then back in Rambler & Co worked), he talked about ML-Pipeline and mentioned about programmers-mathematicians. He spoke just about such universal experts that they are, they are few, and they are very expensive. Therefore, Rambler & Co tries to develop them at home, to look for strong guys. Such specialists are difficult to find.
I believe that if you really have a lot of data and you need scrupulous work with them (and not just predict the sex and age of a person or increase the probability of a click, for example), then it should be two different people. There is a 20/80 rule here: data scientist is 80% data science, 20% - it can write something and output it to a vendor, and a data engineer is an 80% software engineer and 20% knows what models are like them apply and how to count, without deepening into mathematics.
- Tell us about the most important discovery for you in data science data engineering? Maybe the use of some tool of the algorithm radically changed your approach to solving problems?
- Probably that, having a sufficient amount of data, you can extract a lot of useful information for your future actions. Even if at times you do not know what these raw impersonal data are, you can still do something on their basis: break up groups, find some special features, simply derive some patterns by mathematical methods on the numbers. True, analysts could have done this before, but the fact that now it has become more accessible increased the capacity of iron - it's cool! The threshold of entry in data science has now decreased, you do not have much to know, to try something on some tools.
- What was the biggest fan of work? What lesson did you learn from this?
- Probably, I will upset you, I still did not have this, maybe ahead of me. I honestly thought, remembered, but nothing like that was very boring. It's like administrators: if you did not "drop the prod", do not "lose the base," you're not a real administrator. Well, I guess I'm not a real developer.
- What data engineering tools do you use?most often and why? What's your favorite tool?
- I like Apache Kafka very much. A cool tool both in terms of the functionality that it provides, and with the engineering. The specificity of Kafka's work lies in the close relationship of the program code and the operating system on which it operates - Linux (read - "works fast and well"). That is, there are used various native functions of linux-a, which allow to get excellent performance even on weak hardware. I believe that in our area it should be so - it is not enough just to know the programming language and a couple of frameworks for it. If you want to create something really cool, than it will be pleasant to use not only you, but also others, you need to dig deeper and know how your code works in the system, on the hardware.
- What kind of conferences do you go to? What are the profile columns of blogs you read?
- As I said, everything started with podcasts, namely with " Debriefing "- from the guys from the world of java.
There are https://radio-t.com - a cool Russian-language podcast on high and it-technology topics, one of the most popular (if I'm not mistaken) in our language.
I follow the news from JUG.ru , guys make cool hardcore conferences, they arrange mitapes. I try to go to those in Moscow, in St. Petersburg, too, spend. The top java conference is Jpoint in Moscow (it's Joker in St. Petersburg), I always go to Jpoint or watch online.
I look, that do Confluent - The guys who earn corporate support for kafka and are the main committers in it. Also develop convenient tools around Apache Kafka in opensource. I try to use their versions.
Technological blog Netflix on the medium - a cool resource about the solutions of one of the largest platforms for delivering video content to the user. Highload and distributed systems on the most "do not want")
Channels in the Telegraph: https://t.me/hadoopusers - a place where in our language you can talk to data-engineering-marketing topics; https://t.me/jvmchat - java people of the world, discussing its problems, its problems and not only.
"Maybe something else for the soul?"
- I grew up on computer games, I used to play very active, now there is not much time for it. And at some point I thought, "Since I can not play games, what prevents me from studying this area?" And if free time is given out, I take some framework on java, C # or C ++ that can play write, and do something. All this rarely comes to the end product, but I get pleasure. Therefore, in the list of my podcasts there is also one that tells about the creation of games - " How do games "- a good professional podcast is not about how to" encode your super-mega-top game ", namely about the process of production of the game: how the sound engineer works, what the game designer does, the work of 2D /3D artists, their processes, tools, how to develop the game, how to promote it. This spring was for the first time at a gaming conference, it was very cool: it was not that I did not feel at ease, but it turned out to be a completely different world, I liked it. I was glad to learn that in the game world, too, are actively interested in the big date. In conversations on these topics, I felt very confident.
- Java or Python?
- Java, of course.
- Data Science or Data Engineering?
- Data Engineering
- Artist or manager?
- It depends, but so far, rather, the performer.
- Family or career?
- Some time ago there was a career, and now the family.
- Cooking at home or going to a restaurant?
- I like to eat delicious food. I think I'm good at cooking, but it happens rarely. So, probably, go to a restaurant.
It may be interesting