简体   繁体   中英

Convert Row values of a column into multiple columns by value count with Dask Dataframe

Using the pandas library, this operation is very quick to be performed.

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame(columns=['name','contry','pet'], 
                  data=[['paul', 'eua', 'cat'],
                        ['pedro', 'brazil', 'dog'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'dog']])

def pre_transform(data):
    return (data
     .groupby(['name', 'country'])['pet']
     .value_counts()
     .unstack()
     .reset_index()
     .fillna(0)
     .rename_axis([None], axis=1)
    )

pre_transform(df_exp)

output:

|   | name  | country | cat | dog |
|---|-------|---------|-----|-----|
| 0 | paul  | england | 2.0 | 1.0 |
| 1 | paul  | eua     | 1.0 | 0.0 |
| 2 | pedro | brazil  | 0.0 | 1.0 |

But to apply this operation in a dataset with hundreds of gbs, there is no RAM to do this operation with Pandas.

A palliative alternative would be to use pandas through iterations with the chunksize parameter while reading the data.

concat_df = pd.DataFrame()
for chunk in pd.read_csv(path_big_file, chunksize=1_000_000):
    concat_df = pd.concat([concat_df, pre_transform(chunk)])
    
merged_df = concat_df.reset_index(drop=True).groupby(['name', 'country']).sum().reset_index()
display(merged_df)

But pursuing more efficiency, I tried to replicate the same operation with the Dask lib.

My efforts led me to produce the function below, which despite reaching the same result, is VERY inefficient in processing time.

Bad Dask approach:


def pivot_multi_index(ddf, index_columns, pivot_column):
    def get_serie_multi_index(data):
        return data.apply(lambda x:"_".join(x[index_columns].astype(str)), axis=1,meta=("str")).astype('category').cat.as_known()

    return (dd
              .merge(
                  (ddf[index_columns]
                       .assign(FK=(lambda x:get_serie_multi_index(x)))
                       .drop_duplicates()),
                  (ddf
                       .assign(FK=(lambda x:get_serie_multi_index(x)))
                       .assign(**{pivot_column:lambda x: x[pivot_column].astype('category').cat.as_known(),
                               f'{pivot_column}2':lambda x:x[pivot_column]})
                       .pivot_table(index='FK', columns=pivot_column, values=f'{pivot_column}2', aggfunc='count')
                       .reset_index()),
                  on='FK', how='left')
              .drop(['FK'], axis=1)
             )
             
ddf = dd.from_pandas(df_exp, npartitions=3)
index_columns = ['name','country']
pivot_column = 'pet'

merged = pivot_multi_index(ddf, index_columns, pivot_column)
merged.compute()

output

|   | name  | country | cat | dog |
|---|-------|---------|-----|-----|
| 0 | paul  | eua     | 1.0 | 0.0 |
| 1 | pedro | brazil  | 0.0 | 1.0 |
| 2 | paul  | england | 2.0 | 1.0 |

But by applying the above function to a large dataset, it was much slower to run than using pandas by iteration via chunk size.

The question remains:

Given the operation of convert Row values of a column into multiple columns by value count, what would be the most efficient way to achive this goal using the Dask library?

I've had a similar issue before, but my main concern was keeping the potential to scale up while also being able to work out of memory and not gum up my RAM during testing. In your case, the most straight forward approach may be to use dask to read in your data and cut it down to size. Then use pandas to manipulate smaller bites, while dumping it back into dask to free up memory and continue. You may be able to push the loop into a dask apply function that iterates on the groups, but you'll still have the very convenient value_counts() and unstack() functions in the way.

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame(columns=['name','country','pet'], 
                  data=[['paul', 'eua', 'cat'],
                        ['pedro', 'brazil', 'dog'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'dog']])

#obv read your big data into dask here instead of from_pandas
ddf = dd.from_pandas(df, chunksize=1)

#pull some minimal data in to build some grouper keys 
unique = ddf[['name','country']].drop_duplicates().compute()
group_keys = list(zip(unique.name, unique.country))

#out of memory groupby object
groups = ddf.groupby(['name','country'])

#init an empty dask dataframe for concat
ddf_all = dd.from_pandas(pd.DataFrame(), chunksize=1)

#loop each group, pull into memory to manipulate
for each in group_keys:
    df = groups.get_group(each).compute()
    df = df.value_counts().unstack().reset_index()

    #concat back out to release memory
    ddf = dd.from_pandas(df, chunksize=1)
    ddf_all = dd.concat([ddf_all, ddf])

#do some more manipulation if necessary, then compute
ddf_all.fillna(0).compute()

|    | name   | country   |   cat |   dog |
|---:|:-------|:----------|------:|------:|
|  0 | paul   | eua       |     1 |     0 |
|  0 | pedro  | brazil    |     0 |     1 |
|  0 | paul   | england   |     2 |     1 |

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM