简体   繁体   中英

Renaming columns in dask dataframe

I have two questions about dask. First: The documentation for dask clearly states that you can rename columns with the same syntax as pandas. I am using dask 1.0.0. Any reason why I am getting these errors below?

df = pd.DataFrame(dictionary)
df

在此处输入图像描述

# I am not sure how to choose values for divisions, meta, and name. I am also pretty unsure about what these really do.
ddf = dd.DataFrame(dictionary, divisions=[8], meta=pd.DataFrame(dictionary), name='ddf')    
ddf

在此处输入图像描述

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

ddf.rename(columns=cols, inplace=True)

TypeError: rename() got an unexpected keyword argument 'inplace'

Ok so i removed the inplace=True and tried this:

ddf = ddf.rename(columns=cols)

ValueError: dictionary update sequence element #0 has length 6; 2 is required

The pandas dataframe is showing a real dataframe, but when I call ddf.compute() I get an empty dataframe.

在此处输入图像描述

My second question is that I am slightly confused about how to assign divisions, meta, and name. How is this useful/hurtful if I use dask to parallelize on a single machine vs a cluster?

Regarding the renaming, this is how I usually go about changing feature names when I'm using dask, perhaps this will work for you too:

new_columns = ['key', 'Datetime', 'col1', 'col2', 'col3', 'col4', 'col5']
df = df.rename(columns=dict(zip(df.columns, new_columns)))

As for the determining the number of partitions, the documentation gives a pretty good example using time series data for deciding how to divide the dataframe:http://docs.dask.org/en/latest/dataframe-design.html#partitions .

I could not get this line to work (because I was passing dictionary as a basic Python dictionary, which is not the right input)

ddf = dd.DataFrame(dictionary, divisions=[2], meta=pd.DataFrame(dictionary,
                                              index=list(range(2))), name='ddf')

print(ddf.compute())
() # this is the output of ddf.compute(); clearly something is not right

So, I had to create some dummy data and use that in my approach to creating a dask dataframe .

Generate dummy data in a dictionary

d = {0: [388]*2,
 1: [387]*2,
 2: [386]*2,
 3: [385]*2,
 5: [384]*2,
 '2012-06-13': [389]*2,
 '2012-06-14': [389]*2,}

Create Dask dataframe from dictionary dask bag

  • this means you must first use pandas to convert the dictionary to a pandas DataFrame and then use .to_dict(..., orient='records') to get the sequence (list of row-wise dictionaries) you need to create a dask bag

So, here is how I created the required sequence

d = pd.DataFrame(d, index=list(range(2))).to_dict('records')

print(d)
[{0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389},
 {0: 388,
  1: 387,
  2: 386,
  3: 385,
  5: 384,
  '2012-06-13': 389,
  '2012-06-14': 389}]

Now I use the list of dictionaries to create a dask bag

dask_bag = db.from_sequence(d, npartitions=2)

print(dask_bag)
dask.bag<from_se..., npartitions=2>

Convert dask bag to dask dataframe

df = dask_bag.to_dataframe()

Rename columns in dask dataframe

cols = {0:'Datetime',1:'col1',2:'col2',3:'col3',5:'col5'}
df = df.rename(columns=cols)

print(df)
Dask DataFrame Structure:
              Datetime   col1   col2   col3   col5 2012-06-13 2012-06-14
npartitions=2                                                           
                 int64  int64  int64  int64  int64      int64      int64
                   ...    ...    ...    ...    ...        ...        ...
                   ...    ...    ...    ...    ...        ...        ...
Dask Name: rename, 6 tasks

Compute the dask dataframe (will not get output of () this time ! )

print(ddf.compute())
   Datetime  col1  col2  col3  col5  2012-06-13  2012-06-14
0       388   387   386   385   384         389         389
0       388   387   386   385   384         389         389

Notes:

  1. Also from the .rename documentation: inplace is not supported.
  2. I think your renaming dictionary contained strings '0' , '1' , etc. for the column names that were integers. It could be the case for your data (as is the case with the dummy data here) that the dictionary should just have been integers 0 , 1 , etc.
  3. Per the dask docs , I used this approach based on a 1-1 renaming dictionary and column names not included in the renaming dict will be left unchanged
    • this means you don't need to pass in column names that you do not need to be renamed

If you only want to lowercase and delete spaces, you can do:

data = dd.read_csv('*.csv').rename(columns=lambda x: x.lower().replace(' ', '_'))

You can build a dict like this:

columns = {0:'Datetime',1:'col1', ...}

After you read your data:

# you can use dask to read your data
import dask.DataFrame as dd
df = dd.read_json(dictionary)
df = df.rename(columns=columns).compute()

You problem is the key and also the original column name type:

cols = {'Key':'key', '0':'Datetime','1':'col1','2':'col2','3':'col3','4':'col4','5':'col5'}

You should delete 'Key':'key' and also use int number instead of str number

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM