重命名 dask 數據框中的列

Question

我有兩個關於dask的問題。 第一：dask 的文檔明確指出您可以使用與 pandas 相同的語法重命名列。 我正在使用 dask 1.0.0。 我在下面收到這些錯誤的任何原因？

df = pd.DataFrame(dictionary)
df

# I am not sure how to choose values for divisions, meta, and name. I am also pretty unsure about what these really do.
ddf = dd.DataFrame(dictionary, divisions=[8], meta=pd.DataFrame(dictionary), name='ddf')    
ddf

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

ddf.rename(columns=cols, inplace=True)

TypeError: rename() got an unexpected keyword argument 'inplace'

好的，所以我刪除了inplace=True並嘗試了這個：

ddf = ddf.rename(columns=cols)

ValueError: dictionary update sequence element #0 has length 6; 2 is required

pandas 數據框顯示的是一個真實的數據框，但是當我調用ddf.compute()時，我得到一個空的數據框。

我的第二個問題是，我對如何分配部門、元和名稱有點困惑。 如果我使用 dask 在單個機器上與集群上並行化，這有什么用處/傷害？

Answer 1

關於重命名，這就是我通常在使用 dask 時更改功能名稱的方式，也許這對您也有用：

new_columns = ['key', 'Datetime', 'col1', 'col2', 'col3', 'col4', 'col5']
df = df.rename(columns=dict(zip(df.columns, new_columns)))

至於確定分區的數量，文檔給出了一個很好的例子，使用時間序列數據來決定如何划分數據幀：http://docs.dask.org/en/latest/dataframe-design.html#partitions 。

Answer 2

我無法讓這一行工作（因為我將dictionary作為基本的 Python 字典傳遞，這不是正確的輸入）

ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
                                              index=list(range(2))), name='ddf')

print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right

因此，我必須創建一些虛擬數據並在我創建dataframe框的方法中使用它。

在字典中生成虛擬數據

d = {0: [388]*2,
 1: [387]*2,
 2: [386]*2,
 3: [385]*2,
 5: [384]*2,
 '2012-06-13': [389]*2,
 '2012-06-14': [389]*2,}

從字典 dask 包創建 Dask dataframe框

這意味着您必須首先使用 pandas 將字典轉換為 pandas DataFrame ，然后使用.to_dict(..., orient='records')獲取創建 dask 包所需的序列（按行字典的列表）

所以，這就是我創建所需序列的方式

d = pd.DataFrame(d, index=list(range(2))).to_dict('records')

print(d)
[{0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389},
 {0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389}]

現在我使用字典列表來創建一個 dask bag

dask_bag = db.from_sequence(d, npartitions=2)

print(dask_bag)
dask.bag<from_se..., npartitions=2>

將 dask 包轉換為 dask dataframe框

df = dask_bag.to_dataframe()

重命名dataframe框中的列

cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)

print(df)
Dask DataFrame Structure:
              Datetime   col1   col2   col3   col5 2012-06-13 2012-06-14
npartitions=2                                                           
                 int64  int64  int64  int64  int64      int64      int64
                   ...    ...    ...    ...    ...        ...        ...
                   ...    ...    ...    ...    ...        ...        ...
Dask Name: rename, 6 tasks

計算dataframe （這次不會得到()的輸出！）

print(ddf.compute())
   Datetime  col1  col2  col3  col5  2012-06-13  2012-06-14
0       388   387   386   385   384         389         389
0       388   387   386   385   384         389         389

筆記：

同樣來自.rename文檔：不支持inplace 。
我認為您的重命名字典包含整數列名稱的字符串'0' 、 '1'等。 對於您的數據（此處的虛擬數據就是這種情況），字典可能只是整數0 、 1等。
根據dask文檔，我使用了這種基於 1-1 重命名字典的方法，重命名字典中未包含的列名將保持不變
- 這意味着您不需要傳入不需要重命名的列名

Answer 3

如果你只想小寫和刪除空格，你可以這樣做：

data = dd.read_csv('*.csv').rename(columns=lambda x: x.lower().replace(' ', '_'))

Answer 4

你可以像這樣構建一個字典：

columns = {0:'Datetime',1:'col1', ...}

讀取數據后：

# you can use dask to read your data
import dask.DataFrame as dd
df = dd.read_json(dictionary)
df = df.rename(columns=columns).compute()

您的問題是鍵以及原始列名類型：

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

您應該刪除'Key':'key'並使用int number 而不是 str number

重命名 dask 數據框中的列

問題描述

4 個解決方案

解決方案1
10 2018-12-17 10:59:04

解決方案2
4 2019-02-10 02:54:02

解決方案3
0 2020-01-30 19:26:47

解決方案4
0 2022-03-01 11:33:54

重命名 dask 數據框中的列

問題描述

4 個解決方案

解決方案1 10 2018-12-17 10:59:04

解決方案2 4 2019-02-10 02:54:02

解決方案3 0 2020-01-30 19:26:47

解決方案4 0 2022-03-01 11:33:54

解決方案1
10 2018-12-17 10:59:04

解決方案2
4 2019-02-10 02:54:02

解決方案3
0 2020-01-30 19:26:47

解決方案4
0 2022-03-01 11:33:54