I am trying to expand dataframe by creating more columns specific to a value in one categorical column. My dataframe looks like this:
Based on the value of the column cluster
, I would like to create a new dataframe which should be like this :
var1_clus0 , var1_clus1, ... var3_clus2
I have a huge dataset so, I am trying to do this in a nested for loop which works fine for the 1st value of cluster column and all other have NaN.
Below is my script:
data_trans = pd.DataFrame()
for i in np.arange(0, len(varlist),1):
for j in np.arange(0,6,1):
data_trans[str(varlist[i]) + str("_clus_") + str(j)] = data[(data.segment_hc_print == j)][varlist[i]]
The code works without any error and generates the column as desired. But it parses only the first value of categorical column and puts them in a new column in new dataframe. For all other categorical values, it generates NAN. What am I doing wrong and how should I fix this ?
Given the example dataset I gave, following is the desired output: sample output
Since you have a 2D data set and varX and clusX may have multiple matches, you have to decide what you want to do with those matches? I assume you want to add them up. If so, you're looking at either a dataframe with a header row and a single data row, or just a series with the index being your varX_clusX.
The following code will do it:
# Setup
import pandas as pd
import numpy as np
df = pd.DataFrame({
'var1' : np.random.randint(0, 1000000, 1000000),
'var2' : np.random.randint(0, 1000000, 1000000),
'var3' : np.random.randint(0, 1000000, 1000000),
'cluster' : np.random.randint(0, 100, 1000000)
})
# Processing
# Setup the cluster column for string formatting.
df['cluster'] = 'clus' + df['cluster'].apply(str)
# Un-pivot the cluster column (I'm sure there's a better term)
df = df.set_index('cluster').stack().reset_index()
# Group by the unique combination of cluster / var and sum the values.
# This will generate a column named 0 - which I changed to 'values' just for readability.
df = df.groupby(['cluster','level_1']).sum().reset_index().rename(columns = {0 : 'values'})
# Create the formatted header you're looking for
df['piv'] = df['level_1'] + '_' + df['cluster']
# Final pivot to get the values to align with the the new headers
df = df.pivot(columns = 'piv', values = 'values').sum()
Timed this on my machine - roughly 1s for a million records. Not sure how fast you need it.
If you don't want to add all the values and there's an arbitrary index, you can simplify:
df['cluster'] = 'clus' + df['cluster'].apply(str)
df = df.set_index('cluster').stack().reset_index()
df['piv'] = df['level_1'] + '_' + df['cluster']
df = df.pivot(columns = 'piv', values = 0).fillna(0)
This will give you a dataframe the length of your initial dataset x the number of variables and a ton of zeroes.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.