JupyterHub, or how to manage hundreds of Python users. Lecture of Yandex

The Jupyter platform allows novice developers, data analysts, and students to quickly start programming in Python. Suppose your team is growing - it is now not only programmers, but also managers, analysts, researchers. Sooner or later, the lack of a common working environment and the complexity of the setting will start to slow down the work. To cope with this problem will help JupyterHub - multi-user server with the ability to run Jupyter with one button. It is great for those who teach Python, as well as for analysts. The user only needs a browser: no problems with installing the software on the laptop, compatibility, packages. Jupyter maintainers actively develop JupyterHub along with JupyterLab and nteract.
 
 
My name is Andrey Petrin, I am the head of the growth analyst group in Yandex. In a report at the Moscow Python Meetup, I recalled the advantages of Jupyter and told about the architecture and principles of JupyterHub, as well as the experience of using these systems in Yandex. In the end, you will learn how to raise the JupyterHub on any computer.
 
 

 
- I'll start with who such analysts are in Yandex. There is an analogy that this is such a multi-armed Shiva, who knows how to do many different things at the same time and combines many roles.
 
 
Hello! My name is Andrey Petrin, I am the head of the growth analyst group in Yandex. I'll tell you about the JupyterHub library, which at one time greatly simplified our life in Yandex's analytics, we literally felt the productivity boost of a large number of teams.
 
 
 
JupyterHub, or how to manage hundreds of Python users. Lecture of Yandex  
 
For example, analysts in Yandex are a bit of managers. Analysts always know the timing, the timeline of the process, what at what point it is necessary to do.
 
 
It's a bit of a developer, they are familiar with different ways of processing data. For example, on a slide in the hands of Shiva, those Python libraries that come to mind to me are not a complete list, but what is used on a daily basis. Naturally, we are developing not only in Python, but I will primarily talk about Python.
 
 
Analysts - a little bit of mathematics, we need to take decisions in a balanced way, look at the real data, not on the managerial point of view, but look for some truth and understand it.
 
 
In this whole task, the Jupyter ecosystem helps us a lot.
 
 
 
 
It is such a platform for interactive programming and creating notebook reports. The key essence of Jupyter-notebooks is such a laptop, where there is a large number of various widgets, interactive elements that can be changed. As the main essence - microelements of the code that you program. They can be printed in a laptop in your browser, which you constantly use. It can be both pictures, and interactive HTML-derived elements. You can just print, make a print, display elements - a variety of things.
 
 
The Jupyter system has been developing for a long time, it supports various programming languages ​​of different versions, Python 2 and ? Go, various things. Allows us to solve our everyday tasks abruptly.
 
 
What are we doing in analytics and how does Jupyter help us in this?
 
 
 
 
The first task is the classification of websites. It turns out that for a large Yandex, who knows about the entire Internet, looking at specific sites is quite laborious. We have such a number of sites, each of which can have its own specifics, that we need to aggregate this to some themes - groups of sites that are not very large, which generally behave similarly.
 
 
For this task, we construct a contiguity graph of all Internet hosts, a graph of the similarity of two sites on each other. With the help of manual markup of hosts, we get some primary database about which sites on the Internet there are, and then extrapolate the manual markup to the entire Internet. Literally in each of the tasks we use Jupyter. It allows from the point of view of the graph of contiguity to constantly start operations on MapReduce, to build such graphs, to conduct such a date-analytics.
 
 
Manual markup, we automated in Jupyter with the help of widgets-intuitions. At us for each host there is a prospective subjects which, most likely, correct. We almost always guess the subject, but for manual marking people still need.
 
 
And get all kinds of interesting pictures.
 
 
 
 
For example, here you can see the sports theme sites and relevant search queries that are in the sports theme.
 
 
 
 
Subjects of encyclopedias. Sites and as a whole unique requests are smaller, but the basic requests are more.
 
 
 
 
Subject homework - ready homework. Quite interesting, because inside it there are two independent cluster of sites, similar to each other, but not similar to the others. This is a good example of a topic that I would like to break into two. One half of the sites clearly solves one task inside homework assignments, the other is another.
 
 
 
 
It was quite interesting to make the bet optimizer, a completely different task. In Yandex, a number of mobile applications are purchased, including for money, and we are already able to predict the lifetime of the user, how much we can get from installing an application for each user, but it turns out that unfortunately this knowledge is difficult to convey to the marketer, the volume who will be engaged in the purchase of traffic. This is usually due to the fact that there is always a budget, there are a lot of restrictions. It is necessary to do such a multidimensional optimization task, interesting from the point of view of analytics, but it is necessary to make the device for the manager.
 
 
 
 
Jupyter helps a lot here. This is the interface that we developed in Jupyter so that a non-Python knowledge manager user could enter and get the result of our prediction. You can choose whether we choose Android or iOS, in which countries, which application. There are quite complex managers and pens that you can change, for example, some progress bars, the size of the budget, some kind of tolerance for risk. These tasks are solved with the help of Jupyter, and we are very pleased that the analyst, being a multi-armed Shiva, can solve these tasks alone.
 
 
About five years ago we came to the conclusion that there are some limitations and problems of the platform with which I want to fight. The first problem is that we have quite a lot of different analysts, each of them all the time on different versions, operating systems, etc. Often the code that works for one person does not start from the other.
 
 
Another big problem is the package version. I think, it is not necessary to tell how hard it is to maintain some consistent environment, so that everything can start out of the box.
 
 
And as a whole, we began to understand that if we give a new analyst who just came to the team, a preconfigured environment where everything will be set up, all packages are delivered to the current version and are stored in a single place, it is just as suitable for analytical work as and for development. In general, a thoughtful thought for the developer, but it is not always applicable to the analyst precisely because of such constant changes in analytics that occur.
 
 
Here, the library JupyterHub came to our aid.
 
 
 
 
This is a very simple application, it consists of four components, just splitting.
 
 
The first part of the application is responsible for authorization. We need to check the password and login, whether we can let this person go.
 
 
The second is the launch of Jupyter servers, each user is running the same Jupyter server that can run Jupyter laptops. It's the same as what you have on your computer, only in the cloud, if it's a cloud deployment, or on one machine different processes spawn.
 
 
Proxying. There is a single access point to the entire server, and JupyterHub determines which user to which port to go, for the user everything is absolutely transparent. Naturally, some control of the database and the entire system.
 
 
 
 
If you superficially describe what the JupyterHub looks like, the user's browser comes to the JupyterHub system, and if this user has not yet started the server or is not authorized. JupyterHub enters the game and begins to ask questions, create servers and prepare the environment.
 
 
If the user is already logged in, then they directly proxy to their own server, and the Jupyter-laptop actually communicates directly with the person directly, sometimes asking the server about access, whether this user is allowed access to this laptop, etc.
 
 
 
 
The interface is quite simple and convenient. By default, the deployment uses the username and password of the computer where it is deployed. If you have a server where there are several users, then the login and password is the login and password to the system, and the user sees his home directory /home as his home directory. Very convenient, no need to think about any users.
 
 
 
 
 
 
The rest of the interfaces are generally quite familiar to you. These are the standard Jupyter laptops that you've all seen. You can see active laptops.
 
 
 
 
This thing you most likely did not see. This is the JupyterHub control window, you can disable your server, run it or, for example, get a token to communicate on your behalf with JupyterHub, for example, to run some microservices inside the JupyterHub.
 
 
 
 
Finally, for the administrator, each user can be managed, running individual Jupyter servers, stopping them, adding, removing users, disabling all servers, shutting down, enabling the hub. All this is done in the browser without any settings and is convenient enough.
 
 
In general, the system is developing very much.
 
 
 
 
In the picture, the course at UC Berkley, which ended this December, was the biggest data science course in the world, I think it was attended by ?200 students who could not program and came to learn programming. This was done on the JupyterHub platform, students did not need to install any Python on their computer, they could just go into the browser on that server.
 
 
Naturally, with the further stages of training the need for installing packages appeared, but it's cool to solve the problem of the first entry. When you teach Python, and a person does not know this at all, you often realize that some routine related to installing packages, maintaining a system, and so on, is a little extra. You want to inspire a person, tell what kind of world it is, without going into details that a person can master in the future.
 
 
Installation:
 
 
python3 -m pip install jupyterhub
sudo apt-get install npm nodejs-legacy
npm install -g configurable-http-proxy

 
Only Python 3 is supported, inside JupyterHub it is possible to start the cells in the second Python, but JupyterHub itself works only on the third one. The only dependency is this configurable-http-proxy, which is what Python uses for simplicity.
 
 
Configuration:
 
 
jupyterhub --generate-config
 
The first thing you want to do is generate the config. Everything will work itself, even without any settings, by default a local server with some kind of 8000 port will be raised, with access to your users by login and password, it will work only under the court, it will work literally out of the box, but generate-config will create to you a file JupyterHub config where it is literally in the form of the documentation it is possible to read absolutely all its adjustments. This is very convenient, you can not even go into the documentation, understand what lines to include, everything is commented out, you can control, all settings are visible by default.
 
 
 
 
I want to make a pause and a reservation. By default, when you deploy it, you will deploy it on your own server, and if you do not make some effort, namely, you will not use HTTPS, then the server will go up via HTTP, and your passwords and user logins that will enter, will be in the clear light when communicating with JupyterHub. This is a very dangerous story, an incredible number of problems can be ignored here. So do not ignore the problem with HTTPS. If you do not have your own HTTPS certificate, you can create it, or there is a wonderful service letsencrypt.org, which allows you to get certificates for free, and you can run on your domain without problems and without money. It's convenient enough, do not ignore it.
 
 
By default, the hub works from under root, obviously, it spunits from a specific user's own laptops. This can be changed, but by default this is the case. And all users are local, the home directory is leaked for each particular user. I'll tell you in more detail what else you can do.
 
 
The JupyterHub class is that it is such a constructor. Literally in every element of the chart, what I've shown, you can insert, embed your own elements that simplify the work. For example, suppose you do not want your users to drive in a system login and password, it's not very secure or inconvenient. You want to make another login system. This can be done with oauth and, for example, github.
 
 
 
 
Instead of forcing the user to drive in the login and password, you simply enable authorization using two lines of the gith codeAbom, and the user will automatically log in to the githab, and will locally prokinut on the githubnuyu user.
 
 
 
 
Supported from the box are other ways to authorize users. If there is an LDAP, you can. You can have any OAuth, there is a REMOTE_USER Authenticator, which allows remote servers to check access to your local server. All that your heart desires.
 
 
 
 
Suppose you have several types of tasks. For example, one uses a GPU, and for this you need one technology stack, a certain set of packages, and you want to separate it from the CPU with another usage scenario. To do this, you can write your own spanner. This is a system that creates custom Jupyter-laptops. Here is a tutorial with Docker, you can compile a Docker file that will unfold for each user, and the user will be not local, but in his internal container.
 
 
There are a number of other convenient JupyterHub chips, services.
 
 
 
 
Suppose you run on a machine with a limited amount of memory and you want to save time after the user was not on the system, and turn off this user, because he does not use them, and the memory takes up. Or, for example, you have a cloud deployment, and you can save money on virtuals by disabling unused at night, and turning them on only when needed.
 
 
There is a ready service cull_idle_servers, which allows you to turn off any custom servers after inactivity. All the data will be saved, just resources will not be used, you can save a little.
 
 
 
 
I said that literally on every piece of this scheme you can include something your own. You can make some add-on to the proxy, somehow your own way of making users roll. You can write your own authorizer, you can directly communicate with the database using services. You can create your own spawners.
 
 
I want to recommend this project, a system on top of Kubernetes, which allows everything I've just told to be directly deployed in any supporting cloud cubes, literally without any specific settings. This is very convenient if you do not want to bother with your own server, devaps and support. Everything will work out of the box, a very good detailed guide.
 
 
 
 
You need a JupyterHub in case you have several people using Jupyter. And it is not necessary that they use Jupyter for the same thing. This is a convenient system that will allow these people to unite and avoid further problems. And if they, moreover, do the same task - most likely, they will need a more or less consistent set of packages.
 
 
The same in the case that you receive complaints that I have a wonderful model built, some analyst Vasechkin tries to reproduce it and it does not work. In due time we had a constant problem. And of course, the server's consistent state in this greatly helps.
 
 
It's very cool to use this for learning Python. There is a service nbgrader, which over JupyterHub allows you to make comfortable batteries with sending students homework. They fill out the solutions themselves, send back, there is an automatic test, which checks the Jupyter cells and allows you to immediately expose the estimates. Very convenient system, I recommend.
 
 
Imagine that you came to a seminar where you want to show something in Python. At the same time, you do not want to spend the first three hours for everyone to get everything from your how to. You want to start doing something interesting right away.
 
 
You can raise such a system on your server, give your users an Internet address where you can log in and start using, do not waste time on unnecessary routines. That's all, thank you.
+ 0 -

Add comment