How to become a datasintist, if you are over 40 and you are not a programmer
There is an opinion that you can become a dascientist only with a corresponding higher education, and better a degree.
However, the world is changing, technologies are becoming available for mere mortals. I may surprise someone, but today any business analyst is able to master machine learning technologies and achieve results that compete with professional mathematicians, and perhaps even the best.
In order not to be unfounded, I will tell you my story - as an economist I became a data analyst, having obtained the necessary knowledge through online courses and participating in machine learning competitions.
Now I am a leading analyst in the big data group at QIWI, but three years ago I was quite far from datasynes and heard about artificial intelligence only from the news. But then everything changed, thanks in part to Coursera and Kaggle.
So, first things first.
3r3178. About yourself
I am an economist, a long time working as a business consultant. My specialization is the development of a methodology for budgeting and reporting for subsequent automation. If in a simple way - it is about first building the process normally, so that the result from automation will be the result.
3 years ago, at the age of 4? when I felt that I was starting to become bronze from consulting success, and began to think about the need for change. About the next career. I already had the experience of how to start a career from scratch (in 30 years I changed the quiet life of an economist for consulting), so the changes did not frighten me.
It does not occur at once, but when you think about it, it becomes obvious that despite the fact that I have already worked for 20 years, there are still about 25 years ahead of retirement (it has long been understood that you should rely on retirement at 70 years or even later ). In general, the road ahead is longer than the one that has already passed, and it would be nice to go through it with an actual specialty. So, it was worth learning. At that time I was freelance, and for the sake of the future I reduced the number of projects and was able to allocate enough time to study.
While I was thinking where to move further, I discovered the Coursera. The Western approach to education, when you first of all explain the meaning, the general idea, and only then the details, turned out to be close to me. Unlike the brutal Soviet education system, which assumes that only decent ones will come up, they give a chance to those like me, who have gaps in basic education.
I started with business intelligence courses. It was extremely useful for me as a consultant. These same courses helped me to better understand the role of AI-technologies for business development and, most importantly, to see my role in this. This is the same as with other technologies - it is not at all necessary that those who develop new technologies will be the best in their application. For technology to really help a business, it is important to understand this business. Expertise in business processes is no less important than understanding the very technologies of machine learning, big data processing, and so on.
And I plunged into courses on datasynes, statistics, programming.
With interruptions, I have mastered more than 30 courses on the Coursera in a year and no longer felt like an alien in the world of big data and machine learning.
Some courses have recommended Kaggle as an excellent practice site. Do not repeat my mistakes - I came there only when I already felt that I had accumulated enough knowledge. And it was worth doing it half a year earlier, when the first understanding emerged, what and how. I would be steeper for six months. After all, this is not just one of the venues for competitions, it is the best (at present) platform for mastering machine learning in practice, which is useful for both beginners and superguru. And there you grow, as they say, a day in two - only courses without practice will not give such an effect.
My first competition was 3r3r71. competition 3r3188. from Santander Bank - prediction of customer satisfaction. I was a novice and wanted to check the level of my knowledge in business. I combined my experience as a bank client, skills in analyzing business cases and machine learning technology and made a pretty good model with which I climbed into the top 50 in the public leaderbord. It was much higher than my expectations from the first competition, given that more than 5 thousand people participated in it.
But not everything was so simple. I didn’t earn money at the happy end. There is such a common problem among beginners as “retraining a model” that I met in practice. Local validation was poorly organized, I was too focused on public, and as a result - on the closed part of the test, I flew 500+ positions down. Of course, I was upset, but the lesson was good: good validation is the basis of machine learning, and it needs to be taken seriously. Now this component is one of the strengths of my models.
Despite the weak first result, there was confidence that getting into the top is real, you need more practice and additional knowledge.
For those who do not know what Caggle is good at - the community is ready to help newcomers with overcoming some gags, discusses ideas, shares examples of how it works. Well and no less important - at the end of the competition there is an opportunity to study the decisions of the leaders. Learning from one’s experience can make rapid progress. Not necessarily on all the rake itself.
I can’t help but remember the OpenDataseSayns (ods.ai) - the Russian-speaking datasientist community. The machine learning drills that ods organizes are another way to get to know the subject more deeply. Well, as a platform for communication on any issues also helps a lot. If you are thinking about your future in datasynes, and you have not registered with ods yet, this is a serious mistake.
Since the vacancies for datasientists' positions often mentioned expectations of high results on Kaggle, I saw a chance for myself - besides the fact that I am gaining experience, it is possible to fill in an empty resume with a more or less relevant experience. I began to treat Caggle as a job, where a career start could be a bonus.
As soon as free time appeared, I built models on Kaggle, and with each competition the result became better.
I had something that most of the participants did not have - the ability to analyze business cases and my consulting experience, this helped a lot in building models. Six months later, I took the 7th place in the next competition from the Santander Bank and earned my first gold medal.
If you persistently strive for a specific goal, you will achieve it - in June 201? after a year or so of my battles on Kaggle, we, together with Agnis Lyukis, a developer from Latvia, won a competition from Sberbank to predict apartment prices in Moscow.
Our strengths were the understanding of the case (this is a complex task, the solution of which was not worth going to the forehead, as most did) and strong local validation. We finished the competition second in public, but our model almost did not suffer from retraining and didn’t lose much on the closed data - in the final we were the first with a giant margin.
This victory threw me into the top 50 of the global Kaggle ranking, which resulted in job offers. Having studied the options, I chose a bank, as a place where there are many tasks on which you can pump skills, and also feel the whole truth of life when developing models - yet in competitions conditions are more greenhouse.
I had ambitious plans for career growth and the option “not to rush to work for several years to grow to the next level” was not considered. It was necessary to plow up and at work, and in the second shift not to forget about Kaggle. Not easy, but who is easy now? And it gave results - 3 more gold medals and I earned epaulets of Grandmaster on Kaggle plus was fixed in the global top (now 23rd).
Like a cherry on a cake - 3 place in bank scoring competitions, what I professionally did last year. And, as you can see, he did well.
Alas, the truth of life in a bank is also a very conservative and slow decision-making process. The introduction of my models was moving slowly. There was no plan to restructure the work of the entire bank, so it was easier, albeit with regret, to change jobs.
It turned out to be not difficult at all - thanks to the results on Kaggle, the search did not take much time, and for several months I’ve been digging billions of tables in QIWI. We have
a bunch of interesting tasks I am sure that pretty soon we will be able to turn our data into profit for the company - the background of the economist helps a lot with this. Kagglopyp here also appeared in the cashier for several cases.
3r3178. And now about how to succeed in the competition 3r3179.
The most important part is to understand the task and find all the drivers that can affect the result. The better you understand the case, the more chances to perform cool. Anyone can generate hundreds or even thousands of statistical functions, but they can come up with those that have been sharpened specifically for this task and well explain the target, which is much more complicated. Invest in it, and quickly find yourself in the top. It is necessary to apply any relevant experience (business, household, etc.) - it helps a lot.
Then - local validation. Your main enemy is retraining, especially if you use such a powerful technology as gradient boosting. I know how psychologically difficult it is to stop focusing on a public leaderboard, but if you don’t want disappointment, the correct answer is to use cross-validation, say “No” to the delayed sample. Of course, there are exceptions, but even in tasks with time series, you can fasten cross-validation, greatly increasing the reliability of the model. Not always the local validation scheme will be simple, but it is worth spending time on it - both in competitions and in real life. The reward will be stable models.
Of course, it is necessary to study the basic tools well. Knowing the principles of different technologies, you can adequately choose the best tool for solving a specific problem. For tabular data, the current leader is gradient boosting, and specifically Lightgbm. But it is important to be able to use other methods, from logging to neural networks - in life and in competitions will not be superfluous.
By the way, the best way to understand which technologies rule now, when everything is changing rapidly - to see which libraries are used by competition leaders. In recent years, many worthwhile technologies have broken through the Caggle world.
Hyperparameters. It is important to know the key hyperparameters of the tools used. Usually not many parameters need to be changed. My belief is that you should not spend a lot of time on the selection of hyperparameters. Of course, it is necessary to find good hyperparameters, but you should not dwell on it.
Usually, when the model is outlined, I select a more or less stable set of parameters and return to their tuning only towards the end, when other ideas have run out. Common sense dictates that the time spent on creating and testing new variables, libraries, non-standard ideas, can give a much larger increase in the model than the improvement from the transition from a good set of hyper-parameters to the ideal.
If you make a bet on Kaggle as a feature that will pump your resume - consider this as a job, you will not regret. It helped me, will help you.
Well, once again about the competition. It is very high here, so it’s very difficult to win alone. Teamwork is very useful, the synergy of ideas allows you to jump above your head. Feel free to use it.
3r3178. Total 3r3179.
Well, a little motivation in the end. First of all, I proved to myself that I can become a dancinetist in my 44 years. The recipe turned out to be surprisingly simple - online education, business-oriented thinking, efficiency and purposefulness.
3r3186. 3r3187. 3r3188.
Now in every possible way I incite my friends to do the same way. A new digital economy needs (and will need) highly qualified specialists. Coursera + Kaggle - this is just an excellent opportunity to start.
Once after all, Excel was a new and incomprehensible tool (I even remember how difficult the first battles with the traditional calculator were). And now, after all, no one has any doubt that a specialist who is knowledgeable in his business can squeeze out of Excel much more real benefits than the Excel developers themselves.
It will take a little time, and ownership of machine learning tools will become as mandatory as owning Excel, so why not prepare for this in advance and win the competition in the labor market right now?
Moreover, the competition should not be afraid. The more people from the business will come to datasayns - the more money. The introduction of new technologies in traditional sectors of the economy can accelerate the business, and for this, the business must begin to understand the opportunities that new technologies are opening today. In fact, any business analyst, having mastered several courses, may be at the forefront of progress and help his company outrun conservative competitors.
I hope my experience will help someone make an important decision.
If you have any questions about Kaggle, please write, I’ll be happy to answer in the comments.
It may be interesting