![](/img/trans.png)
[英]Read group of rows from Parquet file in Python Pandas / Dask?
[英]Why does Dask read parquet file in a lot slower than Pandas reading same parquet file?
我正在使用 Dask 和 python 測試鑲木地板文件的讀取速度,我發現使用 pandas 讀取相同的文件比 Dask 快得多。 我希望了解為什么會這樣,以及是否有辦法獲得同等性能,
版本所有相關包
print(dask.__version__) print(pd.__version__) print(pyarrow.__version__) print(fastparquet.__version__)
2.6.0 0.25.2 0.15.1 0.3.2
import pandas as pd
import numpy as np
import dask.dataframe as dd
col = [str(i) for i in list(np.arange(40))]
df = pd.DataFrame(np.random.randint(0,100,size=(5000000, 4 * 10)), columns=col)
df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 3.86 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 27.1 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 5.89 s
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 4.84 s
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 503 ms
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 4.12 s
使用混合數據類型 dataframe 時,差異更大。
dtypes: category(7), datetime64[ns](2), float64(1), int64(1), object(9)
memory usage: 973.2+ MB
# df.shape == (8575745, 20)
df.to_parquet('large1.parquet', engine='pyarrow')
# Wall time: 9.67 s
df.to_parquet('large2.parquet', engine='fastparquet')
# Wall time: 33.3 s
# read with Dask
df = dd.read_parquet('large1.parquet', engine='pyarrow').compute()
# Wall time: 34.5 s
df = dd.read_parquet('large2.parquet', engine='fastparquet').compute()
# Wall time: 1min 22s
# read with pandas
df = pd.read_parquet('large1.parquet',engine='pyarrow')
# Wall time: 8.67 s
df = pd.read_parquet('large2.parquet',engine='fastparquet')
# Wall time: 21.8 s
我的第一個猜測是 Pandas 將 Parquet 數據集保存到單個行組中,這不允許像 Dask 這樣的系統進行並行化。 這並不能解釋為什么它更慢,但它確實解釋了為什么它不更快。
有關更多信息,我建議進行分析。 您可能對此文檔感興趣:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.