i Have run the same dataset in dask, in differents ways. and I found that one way is almost 10 times fastest than other!!! I try to find the reason without succes.
import dask.dataframe as dd
from multiprocessing import cpu_count
#Count the number of cores
cores = cpu_count()
#read and part the dataframes by the number of cores
english = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
sep='\r', header=None, names=['ingles'], dtype={'ingles':str})
english = english.repartition(npartitions=cores)
spanish = dd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
sep='\r', header=None, names=['espanol'], dtype={'espanol':str})
spanish = english.repartition(npartitions=cores)
#compute
%time total_dd = dd.merge(english, spanish, left_index=True, right_index=True).compute()
Out: 9.77 seg
import pandas as pd
import dask.dataframe as dd
from multiprocessing import cpu_count
#Count the number of cores
cores = cpu_count()
#Read the Dataframe and part by the number of cores
pd_english = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.en',
sep='\r', header=None, names=['ingles'])
pd_spanish = pd.read_csv('/home/alberto/Escritorio/pycharm/NLP/ignore_files/es-en/europarl-v7.es-en.es',
sep='\r', header=None, names=['espanol'])
english_pd = dd.from_pandas(pd_english, npartitions=cores)
spanish_pd = dd.from_pandas(pd_spanish, npartitions=cores)
#compute
%time total_pd = dd.merge(english_pd, spanish_pd, left_index=True, right_index=True).compute()
Out: 1.31 seg
Someone knows why? is there other way to do it even faster?
Note that:
So in the first variant the timed operation includes:
In the second variant, as far as what is timed, the situation is different. Both DataFrames have already been read before, so the timed operation includes only repartition and merge .
Apparently the source DataFrames are big and reading them takes considerable time, not accounted for in the second variant.
Try another test: Create a function which:
Then compute the execution time of this function.
I suppose, the execution time may be even longer than in the first variant, because:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.