We will understand Nirvana, the universal computing platform of Yandex

Machine learning has become a fashionable term, but when working with large volumes of data it has been a vital necessity for many years. Yandex processes over 200 million requests daily! Once on the Internet there were so few sites that the best of them were placed in the catalog, and now for the relevance of links to pages in the issuance of complex formulas that are learning new and new data. The task falls on the so-called conveyors, regular processes that train and control these formulas.
Today we want to share with the community of Habra our experience in creating the Nirvana computing platform, which, among other things, is used for machine learning tasks.
We will understand Nirvana, the universal computing platform of Yandex
Nirvana is a nonspecialized cloud platform for managing computing processes, where applications are launched in the order specified by the user. Nirvana stores the processes, descriptions, links, process blocks and data associated with them. The processes are in the form of asynchronous graphs.
Developers, analysts and managers of different departments of Yandex use Nirvana to solve computational problems - because not everything can be counted on your laptop (and why else - we will tell at the end of the article when we turn to examples of Nirvana application).
We will tell you what problems were encountered when using the previous solution, describe the key components of Nirvana and explain why this name was chosen for the platform. And then look at the screenshot and go to the tasks for which the platform is useful.


How Nirvana appeared

The process of learning the ranking formulas is a constant and voluminous task. Yandex is now working with technologies. CatBoost and Matrixnet , in both cases the construction of ranking models requires considerable computational resources - and a clear interface.
Service FML (Friendly Machine Learning) at one time was a big step in automation and simplification - he put work with machine learning on the flow. FML has made it easy to access configuration tools for learning parameters, analyzing results, and managing hardware resources for distributed startup on a cluster.
But since users got FML as a ready-made tool, it means that any improvements to the interface and the development of innovations fell on the shoulders of the team. At first it seemed that it was convenient - we add to FML only the necessary features, follow the release cycle, plunge into the users' subject area and make a really friendly service.
But along with these advantages, we received a poor scalability of development. The flow of orders for refinement and improvement of FML forms exceeded all our expectations - and to do everything quickly, we would have to expand the team limitlessly.
FML was created as an internal service for Search, but developers from other departments, whose work tasks were also connected with Matrixnet and machine learning, quickly learned about it. It turned out that the possibilities of FML are much wider than the search problems, and the demand far exceeds our resources - we were at a dead end. How to develop a popular service, if it requires a proportional expansion of the team?
We found the answer for ourselves in an open architecture. In principle, they did not become attached to the subject area, developing Nirvana. Hence the name: the platform is indifferent to the tasks with which you come to it - the development environment is just as indifferent to what your program is about, and it does not matter to the graphics editor which image you are currently editing.
And what is Nirvana important? Accurately and quickly perform an arbitrary process, configured as a graph, at the top of which there are blocks with operations, and connections between the blocks are built according to the data.
Since Nirvana appeared in the company, developers, analysts and managers of various departments of Yandex - not only related to machine learning (other examples - at the end of the article) became interested in it. In a week, Nirvana processes millions of units with operations. Some of them are started from scratch, some are raised from the cache - if the process is put on the thread and the graph frequently restarts, it is likely that some deterministic blocks need not be restarted and you can re-use the result already obtained by such a block in another graph.
Nirvana not only made machine learning more accessible, it became a meeting place: the manager creates a project, calls the developer, then the developer collects the process and launches it, and after many launches the manager watches, the analyst comes to understand the results. Nirvana allowed re-use operations (or entire graphs!) Created and maintained by other users so that you do not have to do double work. Graphs are very different: from several blocks to several thousand operations and data objects. They can be collected in a graphical interface (a screenshot will be at the end of the article) or using API services.

How is Nirvana

There are three big sections in Nirvana: Projects (large business tasks or groups that the guys are working on for common tasks), Operations (the library of ready-made components and the ability to create a new one), Data (the library of all the objects loaded into Nirvana and the ability to download a new one).
Users collect graphs in the Editor. You can persuade someone else's successful process and edit it - or build your own from scratch by dragging blocks of operations or data onto the field and connecting them with links (in Nirvana, the connections between the blocks go according to data).
First, let's talk about the architecture of the system - we think, among our readers there are our backend colleagues who are curious to look at our kitchen. About this, we usually tell in an interview that the candidate was ready for the device of Nirvana.
And then we go to the screenshot of the interface and examples from life.
First, users usually come to the graphical interface of the Nirvana (single page application), over time, many processes transfer permanent processes to API services. In general, Nirvana does not care which interface is used, the graphs are launched the same. But the more production processes are transferred to Nirvana, the more noticeable is that most graphs are launched via the API. The UI remains for experiments and initial configuration, as well as for changes as needed.
On the side of the backend is Data Management : model and storage of information about graphs, operations and results, as well as a layer of services that provides the work of the frontend and the API.
A little bit lower is Workflow Processor , another important component. It ensures the execution of graphs, knowing nothing of which operations they consist of. It initializes the blocks, works with the operation cache, and monitors dependencies. In this case, the execution of the operations themselves in the tasks of the Workflow Processor is not included. This is done by individual external components, which we call processors.
Processors bring in Nirvana specific functionality from a particular domain, they are developed by the users themselves (however, we support the base processors ourselves). Processors have access to our distributed storage, from where they read the input data to perform operations, there they record the results.
The processor in relation to Nirvana plays the role of an external service that implements the given API - that's why you can write your own processor without making changes to Nirvana or to existing processors. There are three main methods: starting, stopping and getting the task status. Nirvana (or rather, the Workflow Processor), making sure that all the incoming dependencies of the operation on the graph are ready, sends a request for launch to the processor specified in the task, transmits the configuration and links to the input data. We periodically request the status of the execution and, in the case of readiness, go further on the dependencies.
The main processor supported by the Nirvana command is called Job-processor . It allows you to run an arbitrary executable on a large Yandex cluster (using the scheduler and resource management system). A distinctive feature of this processor is the launch of applications in full isolation, so parallel launches work exclusively within the allocated resources.
In addition, the application can be launched on several servers in a distributed mode if necessary (this is how Matrixnet works). The user just needs to download the executable file, specify the command line for the start and the required amount of computing resources. The rest of the platform takes over.
Another key component of Nirvana is key-value store , which stores both the results of operations, and downloaded executable files or other resources. We have put into the architecture of Nirvana the opportunity to work with several locations and implementations of the repository, which allows improving the efficiency and structure of data storage, as well as making the necessary migrations without interrupting user processes. During the work of the platform, we managed to live with the CEPH file system and with our MapReduce-a technology and YT data storage, eventually moved to MDS, another internal storage.
Any storage system has limitations. First of all, this is the maximum amount of stored data. With a constant increase in the number of Nirvana users and processes, we risk filling out any, even the largest, repository. But we believe that most of the data in the system is temporary, which means that they can be deleted. Due to the known structure of the experiment, one or another result can be obtained anew by restarting the corresponding graph. And if some data object is needed for the user forever, it can purposefully store it in the Nirvana store with an infinite TTL to protect it from deletion. We have a quota system that allows you to divide the storage between different business tasks.

How Nirvana looks and is useful for

So that you could imagine what the interface of our service looks like, we have attached an example of a graph that prepares and starts evaluating the quality of a formula using Catboost technology.
Why do services and developers of Yandex use Nirvana? Here are a few examples.
1. The process of selecting ads for the Advertising Network With the help of Matrixnet it is realized with the help of graphs of Nirvana. Machine learning allows you to refine the formula by adding new factors to it. Nirvana allows you to visualize the learning process, reuse the results, set up regular training starts - and, if necessary, make edits to the process.
2. Weather Command uses Nirvana for ML tasks. Due to the seasonal variability of the predicted values, it is necessary to constantly retrain the model, adding the most relevant data to the training sample. In Nirvana, there is a graph that automatically clones itself through the API and restarts new versions on fresh data to recalculate and regularly update the model.
Weather also collects experiments in Nirvana to improve the current production solution, tests new features, compares ML algorithms, selecting the necessary settings. Nirvana guarantees the reproducibility of experiments, provides the power to perform volumetric calculations, is able to work with other internal and external products (YT, CatBoost, etc.), eliminates the need for local installation of frameworks.
3. Computer vision team with the help of Nirvana, can search the hyperparameters of the neural network, running a hundred copies of the graph with different parameters - and choose the best ones. Thanks to Nirvana, a new classifier for any task, if necessary, is created "by the button" without the help of computer vision specialists.
4. The Directory team conducts through Toloku and assessors thousands of estimates per day, using Nirvana to automate this automatic conveyor. For example, so the photos are filtered to organizations, there is a collection of new ones via mobile Toloku. Nirvana helps to cluster organizations (find duplicates and glue them together). And most importantly, it is possible to build automatic processes for completely new estimates in a matter of hours.
5. Nirvana is based on all assessment processes on assessors and Toloka, not only important for the Directory. For example, Nirvana helps to organize and customize all the work of Pedestrians, actualizing maps, technical support work and testing by assessors.


are regularly held in our offices. special meetings "Yandex from within" . On one of them we already talked a little about Nirvana (there is video about Nirvana's device
Nirvana's application in computer training
), And it aroused great interest. While it is available only to employees of Yandex, but we want to know your opinion about the device of Nirvana described by us, about those of your problems for which it would be useful. We will be grateful if you tell us about systems similar to ours in the comments. Perhaps your companies already use such computing platforms, and we will be grateful for advice and feedback, stories from practice, stories about your experience.
+ 0 -

Add comment