简体   繁体   English

通过分类列扩展pandas数据框

[英]Expand pandas dataframe by categorical column

I am trying to expand dataframe by creating more columns specific to a value in one categorical column. 我正在尝试通过在一个分类列中创建更多特定于值的列来扩展数据框。 My dataframe looks like this: 我的数据框如下所示:

样本数据

Based on the value of the column cluster , I would like to create a new dataframe which should be like this : 基于列cluster的值,我想创建一个新的数据框,它应该是这样的:

var1_clus0 , var1_clus1, ... var3_clus2

I have a huge dataset so, I am trying to do this in a nested for loop which works fine for the 1st value of cluster column and all other have NaN. 我有一个巨大的数据集,因此,我尝试在嵌套的for循环中执行此操作,该循环对于簇列的第一个值运行正常,而其他所有条件都具有NaN。

Below is my script: 下面是我的脚本:

data_trans = pd.DataFrame()

for i in np.arange(0, len(varlist),1):
    for j in np.arange(0,6,1):
        data_trans[str(varlist[i]) + str("_clus_") + str(j)] = data[(data.segment_hc_print == j)][varlist[i]]

The code works without any error and generates the column as desired. 该代码可以正常工作,并根据需要生成该列。 But it parses only the first value of categorical column and puts them in a new column in new dataframe. 但是它仅解析分类列的第一个值,并将它们放在新数据帧的新列中。 For all other categorical values, it generates NAN. 对于所有其他分类值,它将生成NAN。 What am I doing wrong and how should I fix this ? 我在做什么错,应该如何解决?

样品输出

Given the example dataset I gave, following is the desired output: sample output 给定我给出的示例数据集,以下是所需的输出: 样本输出

Since you have a 2D data set and varX and clusX may have multiple matches, you have to decide what you want to do with those matches? 由于您具有2D数据集,并且varX和clusX可能具有多个匹配项,因此您必须决定要对这些匹配项进行什么处理? I assume you want to add them up. 我假设您要添加它们。 If so, you're looking at either a dataframe with a header row and a single data row, or just a series with the index being your varX_clusX. 如果是这样,则您正在查看的是带有标头行和单个数据行的数据帧,还是仅一个索引为您的varX_clusX的序列。

The following code will do it: 下面的代码可以做到这一点:

# Setup
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'var1'      : np.random.randint(0, 1000000, 1000000),
    'var2'      : np.random.randint(0, 1000000, 1000000),
    'var3'      : np.random.randint(0, 1000000, 1000000),
    'cluster'   : np.random.randint(0, 100, 1000000) 
    })

# Processing

# Setup the cluster column for string formatting.
df['cluster'] = 'clus' + df['cluster'].apply(str)

# Un-pivot the cluster column (I'm sure there's a better term)
df = df.set_index('cluster').stack().reset_index()

# Group by the unique combination of cluster / var and sum the values.
# This will generate a column named 0 - which I changed to 'values' just for readability.
df = df.groupby(['cluster','level_1']).sum().reset_index().rename(columns = {0 : 'values'})

# Create the formatted header you're looking for
df['piv'] = df['level_1'] + '_' + df['cluster']

# Final pivot to get the values to align with the the new headers
df = df.pivot(columns = 'piv', values = 'values').sum()

Timed this on my machine - roughly 1s for a million records. 在我的机器上计时-大约1秒记录一百万条记录。 Not sure how fast you need it. 不知道您需要多快。

If you don't want to add all the values and there's an arbitrary index, you can simplify: 如果您不想添加所有值,并且有一个任意索引,则可以简化:

df['cluster'] = 'clus' + df['cluster'].apply(str)

df = df.set_index('cluster').stack().reset_index()

df['piv'] = df['level_1'] + '_' + df['cluster']

df = df.pivot(columns = 'piv', values = 0).fillna(0)

This will give you a dataframe the length of your initial dataset x the number of variables and a ton of zeroes. 这将为您提供一个数据框,其长度为初始数据集的长度x变量数量和大量零。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM