根据另一列中的值对pandas数据框中的列进行归一化

Question

我想根据另一列中的值对熊猫数据框的一列中的值进行标准化。 从统计意义上讲，这不是纯粹的归一化。 第二个值是一个类型； 我想对每种类型的所有第一个值求和，然后在每一行中，将该值除以该行类型的总数。 一个例子应该使这一点更清楚。

df = pd.read_table(datafile, names = ["A", "B", "value", "type"])

    A   B  value   type
0  A1  B1      1  type1
1  A2  B2      1  type1
2  A1  B1      1  type2
3  A1  B3      1  type3
4  A2  B2      1  type2
5  A2  B4      1  type3
6  A3  B4      1  type2
7  A3  B5      1  type3
8  A4  B6      1  type2
9  A4  B7      1  type3

然后我可以找到类似的总和：

types = df.groupby(["type"])["value"].sum()

type
type1    2
type2    4
type3    4
Name: value, dtype: int64

那我怎么用它来规范每一行的值呢？

我可以使用这样的循环来计算标准化值：

norms = []
for ix, row in df.iterrows():
    norms.append(row["value"]/types[row["type"]])

然后用具有以下值的新列替换该列：

df["value"] = pd.Series(norms)

    A   B  value   type
0  A1  B1   0.50  type1
1  A2  B2   0.50  type1
2  A1  B1   0.25  type2
3  A1  B3   0.25  type3
4  A2  B2   0.25  type2
5  A2  B4   0.25  type3
6  A3  B4   0.25  type2
7  A3  B5   0.25  type3
8  A4  B6   0.25  type2
9  A4  B7   0.25  type3

但是据我了解，使用这样的循环不是很有效或不合适，并且有可能使用一些标准的熊猫函数来做到这一点。

谢谢。

Answer 1

您可以使用transform ，它对每个组执行一个操作，然后将结果扩展回以匹配原始索引。 例如”

>>> df["value"] /= df.groupby("type")["value"].transform(sum)
>>> df
    A   B  value   type
0  A1  B1   0.50  type1
1  A2  B2   0.50  type1
2  A1  B1   0.25  type2
3  A1  B3   0.25  type3
4  A2  B2   0.25  type2
5  A2  B4   0.25  type3
6  A3  B4   0.25  type2
7  A3  B5   0.25  type3
8  A4  B6   0.25  type2
9  A4  B7   0.25  type3

因为我们有

>>> df.groupby("type")["value"].transform(sum)
0    2
1    2
2    4
3    4
4    4
5    4
6    4
7    4
8    4
9    4
dtype: int64

Answer 2

我认为实现此目标的最佳方法是在groupby对象上使用.apply()方法：

# Using backslashes for explicit line continuation, not seen
#   that often in Python but useful in pandas when you're
#   chaining a lot of methods one after the other
df['value_normed'] = df.groupby('type', group_keys=False)\
    .apply(lambda g: g['value'] / g['value'].sum())
df
Out[9]: 
    A   B  value   type  value_normed
0  A1  B1      1  type1          0.50
1  A2  B2      1  type1          0.50
2  A1  B1      1  type2          0.25
3  A1  B3      1  type3          0.25
4  A2  B2      1  type2          0.25
5  A2  B4      1  type3          0.25
6  A3  B4      1  type2          0.25
7  A3  B5      1  type3          0.25
8  A4  B6      1  type2          0.25
9  A4  B7      1  type3          0.25

您需要使用group_keys=False参数，以便该type不会成为每个组数据的索引，从而防止您轻松将转换后的值匹配回原始数据帧。

根据另一列中的值对pandas数据框中的列进行归一化

问题描述

2 个解决方案

解决方案1
4 已采纳 2015-05-20 05:39:33

解决方案2
1 2015-05-20 05:23:48

根据另一列中的值对pandas数据框中的列进行归一化

问题描述

2 个解决方案

解决方案1 4 已采纳 2015-05-20 05:39:33

解决方案2 1 2015-05-20 05:23:48

解决方案1
4 已采纳 2015-05-20 05:39:33

解决方案2
1 2015-05-20 05:23:48