6 ways to significantly speed up pandas with a couple of lines of code. Part 2 /Habr
In previous article you and I looked at some simple ways to speed up Pandas through jit compilation and using multiple cores using tools like Numba and Pandarallel. This time we will talk about more powerful tools with which you can use not only speed up pandas, but also cluster it, thus allowing you to process big data.
Swifter Is another small but pretty smart wrapper over pandas. Depending on the situation, she chooses the most effective optimization method out of the possible ones - vectorization, parallelization or the means of pandas itself. Unlike pandarallel, it does not use bare multiprocessing, but to organize parallel computing. Dask but about him a little later.
We will carry out two tests all on the same news data out of last part:
Take a vectorized function (they themselves are quite optimal in performance)
On the contrary, it’s something complicated, and we’ll see how the swifter will adapt.
For the first test, I use a simple math operation:
def multiply (x):
return x * 5
# df['publish_date'].apply (multiply)
# df['publish_date'].swifter.apply (multiply)
# df['publish_date'].parallel_apply (multiply)
# multiply (df['publish_date'])
In order to show what kind of approach swifter chose, I included pandas processing, a vectorized approach, as well as pandarallel:
in the test.
The graph clearly shows that with the exception of a small overhead that swifter spends on calculating the best method, it is on par with the vectorized version of being the most effective. This means that the optimization is carried out correctly.
Now let's see how he will cope with a more complex, non-vectorized case. Take our word processing function from last part adding work with swifter there:
# calculate the average word length in the title
def mean_word_len (line):
# this cycle just complicates the task
for i in range (6):
words =[len(i) for i in line.split()]
res = sum (words) /len (words)
# to work with strings, we initiate a special function allow_dask_on_strings ()
df['headline_text'].swifter.allow_dask_on_strings (). apply (mean_word_len)
Compare the speed of work:
And here the picture is even more interesting. While the amount of data is small (up to 10?000 lines), swifter uses the methods of pandas itself, which is clearly visible. Further, when pandas is no longer so efficient, parallel processing is enabled, and swifter starts working on several cores, aligning in speed with pandarallel.
Functionality allows not only to parallelize, but also to vectorize the functions of
Automatically determines the best strategy for optimizing calculations, allowing you not to think about where to use it and where not to use
Unfortunately, he is not yet able to apply
applyon grouped data (3-3-33488. groupby 3-3-33489.) 3-3-33527.
Modin engaged in parallelization of computations, using the engine using Dask or Ray , and in essence is not much different from previous projects. However, this is a pretty powerful tool, and I can't call it just a wrapper. Modin implements its own
dataframe a class (although pandas is still used under the hood), in which at the moment there are already ~ 80% of the original functionality, and the remaining 20% refer to implementations of pandas, thus fully repeating its API.
So, let's get started with the setup, for which you only need to install
env variable to the desired engine and import the dataframe class:
# The Dask engine is currently considered experimental, so I took ray
% env MODIN_ENGINE = ray
import modin.pandas as mpd
An interesting modin feature is optimized reading of the file. . For the test, create a 1.2 GB csv file:
df = mpd.read_csv ('abcnews-date-text.csv', header = 0)
df = mpd.concat ([df]* 15)
Now read it with modin and pandas:
In:% timeit mpd.read_csv ('big_csv.csv', header = 0)
??? s ± 176 ms per loop (mean ± std. Dev. Of 5 runs, 1 loop each)
In:% timeit pd.read_csv ('big_csv.csv', header = 0)
22.9 s ± ??? s per loop (mean ± std. Dev. Of 5 runs, 1 loop each)
Received acceleration of about 3 times. Of course, reading a file is not the most common operation, but still nice. Let's see how modin behaves on our main case with text processing:
We see that the use of
apply - not the strongest side of modin, most likely it will start to benefit on larger data, but I didn’t have enough RAM to check it. However, his arsenal does not end there, so we’ll check other operations:
# such a numerical array
served as data. df = pd.DataFrame (np.random.randint (? 10? size = (10 ** ? 6)), columns = list ('abcdef'))
What do we see? Very big overhead. In the case of
nunique we only got acceleration when the size of the data frame grew to
10 ** 7 , in the case of
prod (axis = 1) this did not happen, but the graphs clearly show that the pandas calculation time is growing faster and at a size of
10 ** 8 modin will already be more effective in all cases.
- The modin dataframe API is identical to pandas therefore, to adapt the code for big data, just change one line of
- Very large overhead should not be used on small data. According to my calculations, its use is relevant on data greater than 1GB
- Support for a large number of methods. - currently more than 80% methods have an optimized version of
- Able not only in parallel computing, but also in the clustering of - you can configure the Ray /Dask cluster and modin will connect
- There are very useful feature , allowing to use the disk if the RAM is full
- Both Ray and Dask bring up a pretty useful dashboard in the browser. I used Ray:
Dask - The last and most powerful tool on my list. It has a huge number of features and deserves a separate article, or maybe several. In addition to working with numpy and pandas, he also knows how to machine learning - dask has integrations with sklearn and xgboost , as well as many of their models and tools. And all of this can work both on the cluster and on your local machine with the connection of all cores. However, in this article I will focus on working with pandas.
All you need to do to configure dask is to raise a cluster of workers.
from distributed import Client
# expand the local cluster
client = Client (n_workers = 8)
Dask, like modin, uses its
dataframe the class in which is covered. all the main functionality of :
import dask.dataframe as dd
Now you can start testing. Compare file read speed:
In:% timeit dd.read_csv ('big_csv.csv', header = 0)
??? s ± 798 ms per loop (mean ± std. Dev. Of 7 runs, 1 loop each)
In:% timeit pd.read_csv ('big_csv.csv', header = 0)
19.8 s ± ??? s per loop (mean ± std. Dev. Of 7 runs, 1 loop each)
Dask worked somewhere in 3 times faster than . Let's see how he copes with our main task - acceleration
apply . For comparison, I added here pandarallel and swifter , which in my opinion also did a good job:
# compute () is needed because all calculations in dask are lazy and require
to run. # dd.from_pandas is a convenient way to convert the pandas data frame to dask version
dd.from_pandas (df, npartitions = 8) .apply (mean_word_len, meta = (float)). compute (),
We can safely say that dask showed the best results, working out faster than anyone already with
10 ** 4 lines. Now check out some other useful features:
# The same numerical data as for modin
df = pd.DataFrame (np.random.randint (? 10? size = (10 ** ? 6)), columns = list ('abcdef'))
As with modin, we have a rather large overhead, but the results are rather mixed. For operations with parameter
axis = 0 we did not get acceleration, but the growth rate of the graphs shows that when the data size is
> 10 ** 8 dask will prevail. For operations with
axis = 1 we can even say that pandas will work faster (with the exception of the
method. quantile (axis = 1) ).
Despite the fact that pandas proved to be better in many operations, do not forget that dask is primarily a cluster solution that can work where pandas can not cope (for example, with large data that does not fit into RAM).
- It copes well with acceleration
Very big overhead. Designed to manipulate large datasets that do not fit in memory.
Cluster solution, but it works on the same machine
The dask API copies pandas, but not completely, therefore, adaptation of the code under Dask by replacing only the dataframe class may fail
Support a large number of methods
It should be understood that parallelization is not a solution to all problems, and you should always start by optimizing your code. Before parallelizing a function or applying cluster solutions like Dask, ask yourself: Is it possible to apply vectorization? Is data stored efficiently? Are indexes configured correctly? And if after answering these questions your opinion has not changed, then you really need the described tools, or you are too lazy to do optimization.
Thank you all for your attention! Hope these tools help you!
P.s Trust, but verify - all the code used in the article (benchmarks and graphing), I posted on github
It may be interesting