I'm trying to factorize a column in pandas dataframe using the factorize function so that I can have a unique value starting from 0. My question is if there is a way to replicate the same on Dask Dataframes?
Factorization requires a list of unique values, which can be obtained with .unique()
, then converting to pandas
with .compute()
means that we can apply the factorize
method:
import pandas as pd
import dask.dataframe as dd
cat = pd.Series(['a', 'a', 'c'])
# calculate uniques
uniques_dask = dd.from_pandas(cat, npartitions=3).unique().compute()
# simple pandas
codes, uniques = pd.factorize(uniques_dask)
# create a mapping
mapping = {k:v for k,v in zip(uniques, codes)}
# apply the mapping
dd.from_pandas(cat, npartitions=3).replace(mapping).compute()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.