將 DataFrame 從 Pandas 轉換為 dask

Question

我遵循了這個文檔dask.dataframe.from_pandas並且有稱為npartitions和chunksize可選參數。

所以我試着寫這樣的東西：

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame(...)
df = dd.from_pandas(data=df)

它會引發錯誤消息： ValueError: Exactly one of npartitions and chunksize must be specified.

我想知道如何解決它，我應該如何在調用dask.dataframe.read_csv像 Dask 一樣計算npartitions或chunksize ？

Answer 1

在構建 Dask 數據幀之前，您需要選擇npartitions （分區數）或chunksize （每個分區的大小）。 您需要決定要將 Pandas 數據幀拆分為多少個並行數據幀，或者您希望每個並行數據幀有多大。 理想情況下，您希望根據系統擁有的內存量以及可用的內核數來決定這一點。

Answer 2

可能是DASK中的一個小故障......因為錯誤本身表明我們需要指定npartitions(The number of partitions of the index to create)或chunksize(The number of rows per index partition to use.) ..

看到這個錯誤-

if (npartitions is None) == (chunksize is None):
   raise ValueError("Exactly one of npartitions and chunksize must be specified.")

這里有最佳實踐chunksize和npartitions在DASK dataframes

參考 1 , 參考 2

Answer 3

我認為您需要提供 npartitions 或 chunksize。 就我而言，我嘗試了這兩種情況並且效果很好。 但是當我指定兩個參數時，它給了我同樣的錯誤。

因此，指定兩者之一將清除錯誤。

import dask.dataframe as dd
import pandas as pd

df = pd.read_csv(filepath)
dd_df = dd.from_pandas(df, npartitions=100)

或者

dd_df =dd.from_pandas(df, chunksize=100)

將 DataFrame 從 Pandas 轉換為 dask

問題描述

3 個解決方案

解決方案1
0 2020-10-22 06:22:22

解決方案2
0 2020-10-22 07:29:30

解決方案3
0 2021-07-06 15:23:21

將 DataFrame 從 Pandas 轉換為 dask

問題描述

3 個解決方案

解決方案1 0 2020-10-22 06:22:22

解決方案2 0 2020-10-22 07:29:30

解決方案3 0 2021-07-06 15:23:21

解決方案1
0 2020-10-22 06:22:22

解決方案2
0 2020-10-22 07:29:30

解決方案3
0 2021-07-06 15:23:21