I have a dask dataframe and a dask array with the same number of rows in the same logical order. The dataframe rows are indexed by strings. I am trying to add one of the array columns to the dataframe. I have tried several ways all of which failed in their particular way.
df['col'] = da.col
# TypeError: Column assignment doesn't support type Array
df['col'] = da.to_frame(columns='col')
# TypeError: '<' not supported between instances of 'str' and 'int'
df['col'] = da.to_frame(columns=['col']).set_index(df.col).col
# TypeError: '<' not supported between instances of 'str' and 'int'
df = df.reset_index()
df['col'] = da.to_frame(columns='col')
# ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
and a few other variants.
What is the right way to add a dask array column to a dask dataframe when the structures are logically compatible?
This does seem to work as of dask version 2021.4.0
, and possibly earlier. Just make sure the number of dataframe partitions matches the number of array chunks.
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
ddf = dd.from_pandas(pd.DataFrame({'z': np.arange(100, 104)}),
npartitions=2)
ddf['a'] = da.arange(200,204, chunks=2)
print(ddf.compute())
Output:
z a
0 100 200
1 101 201
2 102 202
3 103 203
The solution is to take out the index column of the original Dask dataframe as plain pandas dataframe, add the Dask array column to it, and then merge it back to the Dask dataframe by the index column
index_col = df['index'].compute()
index_col['new_col'] = da.col.compute()
df = df.merge(index_col, 'left', on='index')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.