Dask returns the same result applying a function to a dataframe

Question

When I apply a function to each dask dataframe in a normal loop:

stats = [run((key, value)) for key, value in tqdm.tqdm(routes_to_process.items())]
stats = {key: value for key, value in stats}

stats_df

I get different results for each dataframe:

{'/R00_Y2011_0/Data_0001_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 06:00:00'),
  'wing_adjust_freq_mins': 15.0,
  'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
  'TotalTimeInMotion_parallel_std': 1.938469269213107,
  'TotalTimeInMotion_parallel_95th': 4.919199116216657,
  'TotalTimeInMotion_parallel_max': 24.971773010775486,
  'TotalTimeInMotion_single_mean': 2.070414732378637,
  'TotalTimeInMotion_single_std': 3.110775531046329,
  'TotalTimeInMotion_single_95th': 7.7415319509222,
  'TotalTimeInMotion_single_max': 37.46848468358555},
 '/R00_Y2011_0/Data_0089_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 05:45:00'),
  'wing_adjust_freq_mins': 15.0,
  'TotalTimeInMotion_parallel_mean': 1.4529180621391111,
  'TotalTimeInMotion_parallel_std': 1.6725803825097267,
  'TotalTimeInMotion_parallel_95th': 4.844115032412909,
  'TotalTimeInMotion_parallel_max': 11.713955279708241,
  'TotalTimeInMotion_single_mean': 2.3589357740318952,
  'TotalTimeInMotion_single_std': 2.6615559537471416,
  'TotalTimeInMotion_single_95th': 7.704699349608058,
  'TotalTimeInMotion_single_max': 20.9864655817048},
 '/R00_Y2011_0/Data_0178_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 06:00:00'),
  'wing_adjust_freq_mins': 15.0,
  'TotalTimeInMotion_parallel_mean': 1.197378736624615,
  'TotalTimeInMotion_parallel_std': 1.7912324573518639,
  'TotalTimeInMotion_parallel_95th': 3.9983496355992054,
  'TotalTimeInMotion_parallel_max': 29.289081453498962,
  'TotalTimeInMotion_single_mean': 1.993188241820573,
  'TotalTimeInMotion_single_std': 2.885638510182584,
  'TotalTimeInMotion_single_95th': 6.9149671451086965,
  'TotalTimeInMotion_single_max': 46.73164328462062},
 '/R00_Y2011_0/Data_0266_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 06:15:00'),
  'wing_adjust_freq_mins': 15.0,
  'TotalTimeInMotion_parallel_mean': 0.8902764238240006,
  'TotalTimeInMotion_parallel_std': 1.0558873949196226,
  'TotalTimeInMotion_parallel_95th': 2.8579851928420603,
  'TotalTimeInMotion_parallel_max': 10.951810428266235,
  'TotalTimeInMotion_single_mean': 1.5333169994564122,
  'TotalTimeInMotion_single_std': 1.8294318193945236,
  'TotalTimeInMotion_single_95th': 4.79554414876004,
  'TotalTimeInMotion_single_max': 19.68630214945556},
 '/R00_Y2011_0/Data_0355_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 05:45:00'),
  'wing_adjust_freq_mins': 15.0,
  'TotalTimeInMotion_parallel_mean': 1.178448380566181,
  'TotalTimeInMotion_parallel_std': 2.3450293462261245,
  'TotalTimeInMotion_parallel_95th': 4.4953148097942774,
  'TotalTimeInMotion_parallel_max': 30.048903002576534,
  'TotalTimeInMotion_single_mean': 1.9654074975262181,
  'TotalTimeInMotion_single_std': 3.7519423792380233,
  'TotalTimeInMotion_single_95th': 6.873441025809852,
  'TotalTimeInMotion_single_max': 43.04718729546469}}

When I use dask:

L = client.map(run, routes_to_process.items())
res = client.gather(L)

I get the same result for each:

[('/R00_Y2011_0/Data_0001_Wings_H16000_F550.csv',
  {'total_time': Timedelta('13 days 06:00:00'),
   'wing_adjust_freq_mins': 15.0,
   'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
   'TotalTimeInMotion_parallel_std': 1.938469269213107,
   'TotalTimeInMotion_parallel_95th': 4.919199116216657,
   'TotalTimeInMotion_parallel_max': 24.971773010775486,
   'TotalTimeInMotion_single_mean': 2.070414732378637,
   'TotalTimeInMotion_single_std': 3.110775531046329,
   'TotalTimeInMotion_single_95th': 7.7415319509222,
   'TotalTimeInMotion_single_max': 37.46848468358555}),
 ('/R00_Y2011_0/Data_0089_Wings_H16000_F550.csv',
  {'total_time': Timedelta('13 days 06:00:00'),
   'wing_adjust_freq_mins': 15.0,
   'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
   'TotalTimeInMotion_parallel_std': 1.938469269213107,
   'TotalTimeInMotion_parallel_95th': 4.919199116216657,
   'TotalTimeInMotion_parallel_max': 24.971773010775486,
   'TotalTimeInMotion_single_mean': 2.070414732378637,
   'TotalTimeInMotion_single_std': 3.110775531046329,
   'TotalTimeInMotion_single_95th': 7.7415319509222,
   'TotalTimeInMotion_single_max': 37.46848468358555}),
 ('/R00_Y2011_0/Data_0178_Wings_H16000_F550.csv',
  {'total_time': Timedelta('13 days 06:00:00'),
   'wing_adjust_freq_mins': 15.0,
   'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
   'TotalTimeInMotion_parallel_std': 1.938469269213107,
   'TotalTimeInMotion_parallel_95th': 4.919199116216657,
   'TotalTimeInMotion_parallel_max': 24.971773010775486,
   'TotalTimeInMotion_single_mean': 2.070414732378637,
   'TotalTimeInMotion_single_std': 3.110775531046329,
   'TotalTimeInMotion_single_95th': 7.7415319509222,
   'TotalTimeInMotion_single_max': 37.46848468358555}),
 ('/R00_Y2011_0/Data_0266_Wings_H16000_F550.csv',
  {'total_time': Timedelta('13 days 06:00:00'),
   'wing_adjust_freq_mins': 15.0,
   'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
   'TotalTimeInMotion_parallel_std': 1.938469269213107,
   'TotalTimeInMotion_parallel_95th': 4.919199116216657,
   'TotalTimeInMotion_parallel_max': 24.971773010775486,
   'TotalTimeInMotion_single_mean': 2.070414732378637,
   'TotalTimeInMotion_single_std': 3.110775531046329,
   'TotalTimeInMotion_single_95th': 7.7415319509222,
   'TotalTimeInMotion_single_max': 37.46848468358555}),
 ('/R00_Y2011_0/Data_0355_Wings_H16000_F550.csv',
  {'total_time': Timedelta('13 days 06:00:00'),
   'wing_adjust_freq_mins': 15.0,
   'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
   'TotalTimeInMotion_parallel_std': 1.938469269213107,
   'TotalTimeInMotion_parallel_95th': 4.919199116216657,
   'TotalTimeInMotion_parallel_max': 24.971773010775486,
   'TotalTimeInMotion_single_mean': 2.070414732378637,
   'TotalTimeInMotion_single_std': 3.110775531046329,
   'TotalTimeInMotion_single_95th': 7.7415319509222,
   'TotalTimeInMotion_single_max': 37.46848468358555})]

Despite the file name being different. How can I prevent this?

Answer 1

It's not clear from this information what's inside routes_to_process , but one potential culprit is that the default option for pure in client.map and client.submit is None . This means that the function is expected to return same results for same inputs (which is where content of routes_to_process matters. One thing to try is to add pure=False :

L = client.map(run, routes_to_process.items(), pure=False)
res = client.gather(L)

Another potential problem is that in the sequential execution you are calling the function using run((key, value)) , and the double brackets suggest that you are passing a tuple to the function, while client.map version will pass two arguments. Depending on how your function is defined, this could lead to a silent failure (because additional inputs are absorbed by **args ). If this is the problem, it's probably worth correcting the definition of the function.

Dask returns the same result applying a function to a dataframe

Question

1 answers

solution1
1 2022-02-10 13:43:10

Dask returns the same result applying a function to a dataframe

Question

1 answers

solution1 1 2022-02-10 13:43:10

solution1
1 2022-02-10 13:43:10