[英]Dask returns the same result applying a function to a dataframe
當我在正常循環中將 function 應用於每個 dask dataframe 時:
stats = [run((key, value)) for key, value in tqdm.tqdm(routes_to_process.items())]
stats = {key: value for key, value in stats}
stats_df
我對每個 dataframe 得到不同的結果:
{'/R00_Y2011_0/Data_0001_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 06:00:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
'TotalTimeInMotion_parallel_std': 1.938469269213107,
'TotalTimeInMotion_parallel_95th': 4.919199116216657,
'TotalTimeInMotion_parallel_max': 24.971773010775486,
'TotalTimeInMotion_single_mean': 2.070414732378637,
'TotalTimeInMotion_single_std': 3.110775531046329,
'TotalTimeInMotion_single_95th': 7.7415319509222,
'TotalTimeInMotion_single_max': 37.46848468358555},
'/R00_Y2011_0/Data_0089_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 05:45:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.4529180621391111,
'TotalTimeInMotion_parallel_std': 1.6725803825097267,
'TotalTimeInMotion_parallel_95th': 4.844115032412909,
'TotalTimeInMotion_parallel_max': 11.713955279708241,
'TotalTimeInMotion_single_mean': 2.3589357740318952,
'TotalTimeInMotion_single_std': 2.6615559537471416,
'TotalTimeInMotion_single_95th': 7.704699349608058,
'TotalTimeInMotion_single_max': 20.9864655817048},
'/R00_Y2011_0/Data_0178_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 06:00:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.197378736624615,
'TotalTimeInMotion_parallel_std': 1.7912324573518639,
'TotalTimeInMotion_parallel_95th': 3.9983496355992054,
'TotalTimeInMotion_parallel_max': 29.289081453498962,
'TotalTimeInMotion_single_mean': 1.993188241820573,
'TotalTimeInMotion_single_std': 2.885638510182584,
'TotalTimeInMotion_single_95th': 6.9149671451086965,
'TotalTimeInMotion_single_max': 46.73164328462062},
'/R00_Y2011_0/Data_0266_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 06:15:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 0.8902764238240006,
'TotalTimeInMotion_parallel_std': 1.0558873949196226,
'TotalTimeInMotion_parallel_95th': 2.8579851928420603,
'TotalTimeInMotion_parallel_max': 10.951810428266235,
'TotalTimeInMotion_single_mean': 1.5333169994564122,
'TotalTimeInMotion_single_std': 1.8294318193945236,
'TotalTimeInMotion_single_95th': 4.79554414876004,
'TotalTimeInMotion_single_max': 19.68630214945556},
'/R00_Y2011_0/Data_0355_Wings_H16000_F550.csv': {'total_time': Timedelta('13 days 05:45:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.178448380566181,
'TotalTimeInMotion_parallel_std': 2.3450293462261245,
'TotalTimeInMotion_parallel_95th': 4.4953148097942774,
'TotalTimeInMotion_parallel_max': 30.048903002576534,
'TotalTimeInMotion_single_mean': 1.9654074975262181,
'TotalTimeInMotion_single_std': 3.7519423792380233,
'TotalTimeInMotion_single_95th': 6.873441025809852,
'TotalTimeInMotion_single_max': 43.04718729546469}}
當我使用 dask 時:
L = client.map(run, routes_to_process.items())
res = client.gather(L)
我得到每個相同的結果:
[('/R00_Y2011_0/Data_0001_Wings_H16000_F550.csv',
{'total_time': Timedelta('13 days 06:00:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
'TotalTimeInMotion_parallel_std': 1.938469269213107,
'TotalTimeInMotion_parallel_95th': 4.919199116216657,
'TotalTimeInMotion_parallel_max': 24.971773010775486,
'TotalTimeInMotion_single_mean': 2.070414732378637,
'TotalTimeInMotion_single_std': 3.110775531046329,
'TotalTimeInMotion_single_95th': 7.7415319509222,
'TotalTimeInMotion_single_max': 37.46848468358555}),
('/R00_Y2011_0/Data_0089_Wings_H16000_F550.csv',
{'total_time': Timedelta('13 days 06:00:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
'TotalTimeInMotion_parallel_std': 1.938469269213107,
'TotalTimeInMotion_parallel_95th': 4.919199116216657,
'TotalTimeInMotion_parallel_max': 24.971773010775486,
'TotalTimeInMotion_single_mean': 2.070414732378637,
'TotalTimeInMotion_single_std': 3.110775531046329,
'TotalTimeInMotion_single_95th': 7.7415319509222,
'TotalTimeInMotion_single_max': 37.46848468358555}),
('/R00_Y2011_0/Data_0178_Wings_H16000_F550.csv',
{'total_time': Timedelta('13 days 06:00:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
'TotalTimeInMotion_parallel_std': 1.938469269213107,
'TotalTimeInMotion_parallel_95th': 4.919199116216657,
'TotalTimeInMotion_parallel_max': 24.971773010775486,
'TotalTimeInMotion_single_mean': 2.070414732378637,
'TotalTimeInMotion_single_std': 3.110775531046329,
'TotalTimeInMotion_single_95th': 7.7415319509222,
'TotalTimeInMotion_single_max': 37.46848468358555}),
('/R00_Y2011_0/Data_0266_Wings_H16000_F550.csv',
{'total_time': Timedelta('13 days 06:00:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
'TotalTimeInMotion_parallel_std': 1.938469269213107,
'TotalTimeInMotion_parallel_95th': 4.919199116216657,
'TotalTimeInMotion_parallel_max': 24.971773010775486,
'TotalTimeInMotion_single_mean': 2.070414732378637,
'TotalTimeInMotion_single_std': 3.110775531046329,
'TotalTimeInMotion_single_95th': 7.7415319509222,
'TotalTimeInMotion_single_max': 37.46848468358555}),
('/R00_Y2011_0/Data_0355_Wings_H16000_F550.csv',
{'total_time': Timedelta('13 days 06:00:00'),
'wing_adjust_freq_mins': 15.0,
'TotalTimeInMotion_parallel_mean': 1.2554810767556184,
'TotalTimeInMotion_parallel_std': 1.938469269213107,
'TotalTimeInMotion_parallel_95th': 4.919199116216657,
'TotalTimeInMotion_parallel_max': 24.971773010775486,
'TotalTimeInMotion_single_mean': 2.070414732378637,
'TotalTimeInMotion_single_std': 3.110775531046329,
'TotalTimeInMotion_single_95th': 7.7415319509222,
'TotalTimeInMotion_single_max': 37.46848468358555})]
盡管文件名不同。 我怎樣才能防止這種情況發生?
從這些信息中不清楚routes_to_process
里面有什么,但一個潛在的罪魁禍首是client.map
和client.submit
中pure
的默認選項是None
。 這意味着 function 預計會為相同的輸入返回相同的結果(這是routes_to_process
的內容很重要的地方。嘗試的一件事是添加pure=False
:
L = client.map(run, routes_to_process.items(), pure=False)
res = client.gather(L)
另一個潛在的問題是,在順序執行中,您使用run((key, value))
調用 function,並且雙括號表明您將一個元組傳遞給 function,而client.map
版本將傳遞兩個 arguments。取決於關於您的 function 是如何定義的,這可能會導致靜默失敗(因為額外的輸入被**args
吸收)。 如果這是問題所在,可能值得更正 function 的定義。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.