简体   繁体   English

使用 Dask Dataframe 按值计数将一列的行值转换为多列

[英]Convert Row values of a column into multiple columns by value count with Dask Dataframe

Using the pandas library, this operation is very quick to be performed.使用 pandas 库,可以非常快速地执行此操作。

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame(columns=['name','contry','pet'], 
                  data=[['paul', 'eua', 'cat'],
                        ['pedro', 'brazil', 'dog'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'dog']])

def pre_transform(data):
    return (data
     .groupby(['name', 'country'])['pet']
     .value_counts()
     .unstack()
     .reset_index()
     .fillna(0)
     .rename_axis([None], axis=1)
    )

pre_transform(df_exp)

output:输出:

|   | name  | country | cat | dog |
|---|-------|---------|-----|-----|
| 0 | paul  | england | 2.0 | 1.0 |
| 1 | paul  | eua     | 1.0 | 0.0 |
| 2 | pedro | brazil  | 0.0 | 1.0 |

But to apply this operation in a dataset with hundreds of gbs, there is no RAM to do this operation with Pandas.但是要在数百 gb 的数据集中应用此操作,没有 RAM 来使用 Pandas 执行此操作。

A palliative alternative would be to use pandas through iterations with the chunksize parameter while reading the data.一种姑息的替代方法是在读取数据时通过带有 chunksize 参数的迭代来使用 pandas。

concat_df = pd.DataFrame()
for chunk in pd.read_csv(path_big_file, chunksize=1_000_000):
    concat_df = pd.concat([concat_df, pre_transform(chunk)])
    
merged_df = concat_df.reset_index(drop=True).groupby(['name', 'country']).sum().reset_index()
display(merged_df)

But pursuing more efficiency, I tried to replicate the same operation with the Dask lib.但为了提高效率,我尝试用 Dask 库复制相同的操作。

My efforts led me to produce the function below, which despite reaching the same result, is VERY inefficient in processing time.我的努力使我产生了下面的功能,尽管达到了相同的结果,但在处理时间上效率非常低。

Bad Dask approach: Bad Dask 方法:


def pivot_multi_index(ddf, index_columns, pivot_column):
    def get_serie_multi_index(data):
        return data.apply(lambda x:"_".join(x[index_columns].astype(str)), axis=1,meta=("str")).astype('category').cat.as_known()

    return (dd
              .merge(
                  (ddf[index_columns]
                       .assign(FK=(lambda x:get_serie_multi_index(x)))
                       .drop_duplicates()),
                  (ddf
                       .assign(FK=(lambda x:get_serie_multi_index(x)))
                       .assign(**{pivot_column:lambda x: x[pivot_column].astype('category').cat.as_known(),
                               f'{pivot_column}2':lambda x:x[pivot_column]})
                       .pivot_table(index='FK', columns=pivot_column, values=f'{pivot_column}2', aggfunc='count')
                       .reset_index()),
                  on='FK', how='left')
              .drop(['FK'], axis=1)
             )
             
ddf = dd.from_pandas(df_exp, npartitions=3)
index_columns = ['name','country']
pivot_column = 'pet'

merged = pivot_multi_index(ddf, index_columns, pivot_column)
merged.compute()

output输出

|   | name  | country | cat | dog |
|---|-------|---------|-----|-----|
| 0 | paul  | eua     | 1.0 | 0.0 |
| 1 | pedro | brazil  | 0.0 | 1.0 |
| 2 | paul  | england | 2.0 | 1.0 |

But by applying the above function to a large dataset, it was much slower to run than using pandas by iteration via chunk size.但是通过将上述函数应用于大型数据集,运行起来比通过块大小迭代使用 pandas 慢得多。

The question remains:问题仍然存在:

Given the operation of convert Row values of a column into multiple columns by value count, what would be the most efficient way to achive this goal using the Dask library?鉴于按值计数将一列的行值转换为多列的操作,使用 Dask 库实现此目标的最有效方法是什么?

I've had a similar issue before, but my main concern was keeping the potential to scale up while also being able to work out of memory and not gum up my RAM during testing.我以前遇到过类似的问题,但我主要关心的是保持扩展的潜力,同时还能在内存不足的情况下工作,而不是在测试期间占用我的 RAM。 In your case, the most straight forward approach may be to use dask to read in your data and cut it down to size.在您的情况下,最直接的方法可能是使用 dask 读取您的数据并将其缩小到一定大小。 Then use pandas to manipulate smaller bites, while dumping it back into dask to free up memory and continue.然后使用 pandas 操作较小的咬合,同时将其转储回 dask 以释放内存并继续。 You may be able to push the loop into a dask apply function that iterates on the groups, but you'll still have the very convenient value_counts() and unstack() functions in the way.您可以将循环推入一个在组上迭代的 dask apply 函数,但您仍然可以使用非常方便的value_counts()unstack()函数。

import pandas as pd
from dask import dataframe as dd

df = pd.DataFrame(columns=['name','country','pet'], 
                  data=[['paul', 'eua', 'cat'],
                        ['pedro', 'brazil', 'dog'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'cat'],
                        ['paul', 'england', 'dog']])

#obv read your big data into dask here instead of from_pandas
ddf = dd.from_pandas(df, chunksize=1)

#pull some minimal data in to build some grouper keys 
unique = ddf[['name','country']].drop_duplicates().compute()
group_keys = list(zip(unique.name, unique.country))

#out of memory groupby object
groups = ddf.groupby(['name','country'])

#init an empty dask dataframe for concat
ddf_all = dd.from_pandas(pd.DataFrame(), chunksize=1)

#loop each group, pull into memory to manipulate
for each in group_keys:
    df = groups.get_group(each).compute()
    df = df.value_counts().unstack().reset_index()

    #concat back out to release memory
    ddf = dd.from_pandas(df, chunksize=1)
    ddf_all = dd.concat([ddf_all, ddf])

#do some more manipulation if necessary, then compute
ddf_all.fillna(0).compute()

|    | name   | country   |   cat |   dog |
|---:|:-------|:----------|------:|------:|
|  0 | paul   | eua       |     1 |     0 |
|  0 | pedro  | brazil    |     0 |     1 |
|  0 | paul   | england   |     2 |     1 |

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM