简体   繁体   中英

add a dask.array column to a dask.dataframe

I have a dask dataframe and a dask array with the same number of rows in the same logical order. The dataframe rows are indexed by strings. I am trying to add one of the array columns to the dataframe. I have tried several ways all of which failed in their particular way.

df['col'] = da.col
# TypeError: Column assignment doesn't support type Array

df['col'] = da.to_frame(columns='col')
# TypeError: '<' not supported between instances of 'str' and 'int'

df['col'] = da.to_frame(columns=['col']).set_index(df.col).col
# TypeError: '<' not supported between instances of 'str' and 'int'

df = df.reset_index()
df['col'] = da.to_frame(columns='col')
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

and a few other variants.

What is the right way to add a dask array column to a dask dataframe when the structures are logically compatible?

This does seem to work as of dask version 2021.4.0 , and possibly earlier. Just make sure the number of dataframe partitions matches the number of array chunks.

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
ddf = dd.from_pandas(pd.DataFrame({'z': np.arange(100, 104)}),
                     npartitions=2)
ddf['a'] = da.arange(200,204, chunks=2)
print(ddf.compute())

Output:

     z    a
0  100  200
1  101  201
2  102  202
3  103  203

The solution is to take out the index column of the original Dask dataframe as plain pandas dataframe, add the Dask array column to it, and then merge it back to the Dask dataframe by the index column

index_col = df['index'].compute()
index_col['new_col'] = da.col.compute()
df = df.merge(index_col, 'left', on='index')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM