简体   繁体   中英

Factorize on Dask DataFrames

I'm trying to factorize a column in pandas dataframe using the factorize function so that I can have a unique value starting from 0. My question is if there is a way to replicate the same on Dask Dataframes?

Factorization requires a list of unique values, which can be obtained with .unique() , then converting to pandas with .compute() means that we can apply the factorize method:

import pandas as pd
import dask.dataframe as dd

cat = pd.Series(['a', 'a', 'c'])

# calculate uniques
uniques_dask = dd.from_pandas(cat, npartitions=3).unique().compute()

# simple pandas
codes, uniques = pd.factorize(uniques_dask)

# create a mapping
mapping = {k:v for k,v in zip(uniques, codes)}

# apply the mapping
dd.from_pandas(cat, npartitions=3).replace(mapping).compute()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM