Python Pandas：为源列的每个不同值创建一个新列（使用布尔输出作为列值）

Question

I am trying to split a source column of a dataframe in several columns based on its content, and then fill this newly generated columns with a boolean 1 or 0 in the following way: 我试图根据内容在几列中拆分数据框的源列，然后按照以下方式用布尔值1或0填充这些新生成的列：

Original dataframe: 原始数据帧：

ID   source_column
A    value 1
B    NaN
C    value 2
D    value 3
E    value 2

Generating the following output: 生成以下输出：

ID   source_column    value 1    value 2    value 3
A    value 1          1          0          0
B    NaN              0          0          0
C    value 2          0          1          0
D    value 3          0          0          1
E    value 2          0          1          0

I thought about manually create each different column, and then with a function for each column and .apply, filling the newly column with a 1 or a 0, but this is highly ineffective. 我想过手动创建每个不同的列，然后使用每个列的函数和.apply，用1或0填充新列，但这是非常无效的。

Is there a quick and efficient way for this? 有一种快速有效的方法吗？

Answer 1

You can try: 你可以试试：

df = pd.get_dummies(df, columns=['source_column'])

or if you prefer sklearn 或者如果你喜欢sklearn

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
matrix=enc.fit_transform(df['source_column'])

Answer 2

You can use the pandas function get_dummies, and add the result to df as shown below 您可以使用pandas函数get_dummies，并将结果添加到df，如下所示

In [1]: col_names = df['source_column'].dropna().unique().tolist()

In [2]: df[col_names] = pd.get_dummies(df['source_column'])

In [3]: df
Out[3]: 
  ID source_column  value 1  value 2  value 3
0  A       value 1        1        0        0
1  B          NaN         0        0        0
2  C       value 2        0        1        0
3  D       value 3        0        0        1
4  E       value 2        0        1        0

Answer 3

So there is this possibility (a little bit hacky). 所以有这种可能性（有点hacky）。

Reading the DataFrame from your example data: 从示例数据中读取DataFrame：

In [4]: df = pd.read_clipboard().drop("ID", axis=1)

In [5]: df
Out[5]:
   source_column
A            1.0
B            NaN
C            2.0
D            3.0
E            2.0

After that, adding a new column with df['foo'] = 1 . 之后，添加一个df['foo'] = 1的新列。

Then work with unstacking : 然后使用取消堆栈：

In [22]: df.reset_index().set_index(['index', 'source_column']).unstack().fillna(0).rename_axis([None]).astype(int)
Out[22]:
              foo
source_column NaN 1.0 2.0 3.0
A               0   1   0   0
B               1   0   0   0
C               0   0   1   0
D               0   0   0   1
E               0   0   1   0

You then of course have to rename your columns and drop the Nan col, but that should fulfill your needs in a first run. 然后，您当然必须重命名您的列并删除Nan col，但这应该在第一次运行时满足您的需求。

EDIT: 编辑：

Other approach to suppress the nan column, you can use groupby+value_counts (kind of hacky too): 其他抑制nan列的方法，你可以使用groupby + value_counts（hacky也是如此）：

In [30]: df.reset_index().groupby("index").source_column.value_counts().unstack().fillna(0).astype(int).rename_axis([None])
Out[30]:
source_column  1.0  2.0  3.0
A                1    0    0
C                0    1    0
D                0    0    1
E                0    1    0

This is the same idea (unstacking) but suppresses the nan values to be considered by default. 这是相同的想法（取消堆叠）但是抑制了默认情况下要考虑的nan值。 You of course have to merge it on the original dataframe to keep the rows with the nan values if you want that. 您当然必须将其合并到原始数据框上，以便在需要时保留具有nan值的行。 So at all, both approaches work fine, you can choose the one which fulfills your needs best. 所以，两种方法都可以正常工作，您可以选择最能满足您需求的方法。

Answer 4

pd.concat([df,pd.crosstab(df.index,df.source_column)],1).fillna(0)

Out[1028]: 
  ID source_column  value1  value2  value3
0  A        value1     1.0     0.0     0.0
1  B             0     0.0     0.0     0.0
2  C        value2     0.0     1.0     0.0
3  D        value3     0.0     0.0     1.0
4  E        value2     0.0     1.0     0.0

Python Pandas：为源列的每个不同值创建一个新列（使用布尔输出作为列值）

问题描述

4 个解决方案

解决方案1
3 2018-02-06 15:57:23

解决方案2
3 2018-02-06 16:07:40

解决方案3
1 2018-02-06 15:49:59

EDIT: 编辑：

解决方案4
1 2018-02-06 16:09:40

Python Pandas：为源列的每个不同值创建一个新列（使用布尔输出作为列值）

问题描述

4 个解决方案

解决方案1 3 2018-02-06 15:57:23

解决方案2 3 2018-02-06 16:07:40

解决方案3 1 2018-02-06 15:49:59

EDIT: 编辑：

解决方案4 1 2018-02-06 16:09:40

解决方案1
3 2018-02-06 15:57:23

解决方案2
3 2018-02-06 16:07:40

解决方案3
1 2018-02-06 15:49:59

解决方案4
1 2018-02-06 16:09:40