简体   繁体   English

Python Pandas:为源列的每个不同值创建一个新列(使用布尔输出作为列值)

[英]Python Pandas: create a new column for each different value of a source column (with boolean output as column values)

I am trying to split a source column of a dataframe in several columns based on its content, and then fill this newly generated columns with a boolean 1 or 0 in the following way: 我试图根据内容在几列中拆分数据框的源列,然后按照以下方式用布尔值1或0填充这些新生成的列:

Original dataframe: 原始数据帧:

ID   source_column
A    value 1
B    NaN
C    value 2
D    value 3
E    value 2

Generating the following output: 生成以下输出:

ID   source_column    value 1    value 2    value 3
A    value 1          1          0          0
B    NaN              0          0          0
C    value 2          0          1          0
D    value 3          0          0          1
E    value 2          0          1          0

I thought about manually create each different column, and then with a function for each column and .apply, filling the newly column with a 1 or a 0, but this is highly ineffective. 我想过手动创建每个不同的列,然后使用每个列的函数和.apply,用1或0填充新列,但这是非常无效的。

Is there a quick and efficient way for this? 有一种快速有效的方法吗?

You can try: 你可以试试:

df = pd.get_dummies(df, columns=['source_column'])

or if you prefer sklearn 或者如果你喜欢sklearn

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
matrix=enc.fit_transform(df['source_column'])

You can use the pandas function get_dummies, and add the result to df as shown below 您可以使用pandas函数get_dummies,并将结果添加到df,如下所示

In [1]: col_names = df['source_column'].dropna().unique().tolist()

In [2]: df[col_names] = pd.get_dummies(df['source_column'])

In [3]: df
Out[3]: 
  ID source_column  value 1  value 2  value 3
0  A       value 1        1        0        0
1  B          NaN         0        0        0
2  C       value 2        0        1        0
3  D       value 3        0        0        1
4  E       value 2        0        1        0

So there is this possibility (a little bit hacky). 所以有这种可能性(有点hacky)。

Reading the DataFrame from your example data: 从示例数据中读取DataFrame:

In [4]: df = pd.read_clipboard().drop("ID", axis=1)

In [5]: df
Out[5]:
   source_column
A            1.0
B            NaN
C            2.0
D            3.0
E            2.0

After that, adding a new column with df['foo'] = 1 . 之后,添加一个df['foo'] = 1的新列。

Then work with unstacking : 然后使用取消堆栈

In [22]: df.reset_index().set_index(['index', 'source_column']).unstack().fillna(0).rename_axis([None]).astype(int)
Out[22]:
              foo
source_column NaN 1.0 2.0 3.0
A               0   1   0   0
B               1   0   0   0
C               0   0   1   0
D               0   0   0   1
E               0   0   1   0

You then of course have to rename your columns and drop the Nan col, but that should fulfill your needs in a first run. 然后,您当然必须重命名您的列并删除Nan col,但这应该在第一次运行时满足您的需求。

EDIT: 编辑:

Other approach to suppress the nan column, you can use groupby+value_counts (kind of hacky too): 其他抑制nan列的方法,你可以使用groupby + value_counts(hacky也是如此):

In [30]: df.reset_index().groupby("index").source_column.value_counts().unstack().fillna(0).astype(int).rename_axis([None])
Out[30]:
source_column  1.0  2.0  3.0
A                1    0    0
C                0    1    0
D                0    0    1
E                0    1    0

This is the same idea (unstacking) but suppresses the nan values to be considered by default. 这是相同的想法(取消堆叠)但是抑制了默认情况下要考虑的nan值。 You of course have to merge it on the original dataframe to keep the rows with the nan values if you want that. 您当然必须将其合并到原始数据框上,以便在需要时保留具有nan值的行。 So at all, both approaches work fine, you can choose the one which fulfills your needs best. 所以,两种方法都可以正常工作,您可以选择最能满足您需求的方法。

pd.concat([df,pd.crosstab(df.index,df.source_column)],1).fillna(0)

Out[1028]: 
  ID source_column  value1  value2  value3
0  A        value1     1.0     0.0     0.0
1  B             0     0.0     0.0     0.0
2  C        value2     0.0     1.0     0.0
3  D        value3     0.0     0.0     1.0
4  E        value2     0.0     1.0     0.0

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pandas 根据 boolean 值创建新列 - pandas create new column based on boolean value 使用依赖于另一列的布尔值创建一个新的Pandas df列 - Create a new Pandas df column with boolean values that depend on another column 为每个不同的值创建新列,包括Python / Pandas中相应值的外观总和 - Create new column for each different value including sum of appearances of corresponding value in Python/Pandas Python Pandas:为特定列值的每个实例创建新列 - Python Pandas: Create New Columns For Each Instance of A Particular Column Value Python Pandas 旋转:如何在第一列中分组并为第二列中的每个唯一值创建一个新列 - Python Pandas pivoting: how to group in the first column and create a new column for each unique value from the second column 为列的每个标签创建一个具有布尔值的列 - Create a column with boolean values for each label of a column 按 boolean 变量分组,并为每组 pandas 的结果创建一个新列 - Group by a boolean variable and create a new column with the result for each group pandas 基于每个唯一值的条件的新pandas布尔列 - New pandas boolean column based on conditions for each unique value Python pandas:通过将存储在不同列中的值转换为中文繁体,创建一个包含英文值的新列 - Python pandas: Create a new column with values in English by converting values stored in a different column in Chinese traditional Python pandas 基于 boolean 行创建带有字符串代码的新列 - Python pandas create new column with string code based on boolean rows
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM