[英]Most efficient way to calculate the mean of a group of columns in a pandas DataFrame
I have a DataFrame
with columns like this:我有一个带有如下列的DataFrame
:
["A_1", "A_2", "A_3", "B_1", "B_2", "B_3"]
I'd like to "collapse" the various A and B columns in a single column each and calculate their mean value.我想将各个 A 和 B 列“折叠”在一列中并计算它们的平均值。 In short, at the end of the operation I'd get:简而言之,在操作结束时,我会得到:
["A", "B"]
where "A" is the column-wise mean of all "A" columns and "B" the mean of all "B" columns.其中“A”是所有“A”列的列均值,“B”是所有“B”列的均值。
As far as I understood, groupby
is not suited for this task, or perhaps I'm using it incorrectly:据我了解, groupby
不适合此任务,或者我使用它不正确:
grouped = data.groupby([item for item in data if "A" not in item])
If I use axis=1
, all I get is an empty DataFrame when calling mean(), and if not I'm not getting the desired effect.如果我使用axis=1
,则在调用 mean() 时得到的只是一个空的 DataFrame ,否则我将无法获得所需的效果。 I would like to avoid building a separate DataFrame to be fillled with the means via iteration (eg by calculating means separately then adding them like new_df["A"] = mean_a
).我想避免构建一个单独的 DataFrame 以通过迭代来填充手段(例如,通过单独计算手段然后像new_df["A"] = mean_a
一样添加它们)。 Is there an efficient solution for this?有没有有效的解决方案?
You want to make use of the built-in mean()
function that accepts an axis
argument to specify row-wise means.您想使用内置的mean()
函数,该函数接受axis
参数来指定逐行均值。 Since you know your specific column name convention for the different means that you want, you can use the example code below to do it very efficiently.由于您知道您想要的不同方式的特定列名称约定,因此您可以使用下面的示例代码非常有效地完成此操作。 Here I chose to just make two additional columns rather than to actually destroy the existing data.在这里,我选择只创建两个额外的列,而不是实际销毁现有数据。 I could have also put these new columns into a new data frame;我也可以将这些新列放入一个新的数据框中; it just depends on what your needs are and what's convenient for you.这只是取决于您的需求是什么以及什么对您来说方便。 The same basic idea will work in either case.相同的基本思想在任何一种情况下都适用。
In [1]: import pandas
In [2]: dfrm = pandas.DataFrame([[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18]], columns = ['A_1', 'A_2', 'A_3', 'B_1', 'B_2', 'B_3'])
In [3]: dfrm
Out[3]:
A_1 A_2 A_3 B_1 B_2 B_3
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
In [4]: dfrm["A_mean"] = dfrm[[elem for elem in dfrm.columns if elem[0]=='A']].mean(axis=1)
In [5]: dfrm
Out[5]:
A_1 A_2 A_3 B_1 B_2 B_3 A_mean
0 1 2 3 4 5 6 2
1 7 8 9 10 11 12 8
2 13 14 15 16 17 18 14
In [6]: dfrm["B_mean"] = dfrm[[elem for elem in dfrm.columns if elem[0]=='B']].mean(axis=1)
In [7]: dfrm
Out[7]:
A_1 A_2 A_3 B_1 B_2 B_3 A_mean B_mean
0 1 2 3 4 5 6 2 5
1 7 8 9 10 11 12 8 11
2 13 14 15 16 17 18 14 17
I don't know about efficient, but I might do something like this:我不知道效率,但我可能会做这样的事情:
~/coding$ cat colgroup.dat
A_1,A_2,A_3,B_1,B_2,B_3
1,2,3,4,5,6
7,8,9,10,11,12
13,14,15,16,17,18
~/coding$ python
Python 2.7.3 (default, Apr 20 2012, 22:44:07)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> df = pandas.read_csv("colgroup.dat")
>>> df
A_1 A_2 A_3 B_1 B_2 B_3
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
>>> grouped = df.groupby(lambda x: x[0], axis=1)
>>> for i, group in grouped:
... print i, group
...
A A_1 A_2 A_3
0 1 2 3
1 7 8 9
2 13 14 15
B B_1 B_2 B_3
0 4 5 6
1 10 11 12
2 16 17 18
>>> grouped.mean()
key_0 A B
0 2 5
1 8 11
2 14 17
I suppose lambda x: x.split('_')[0]
would be a little more robust.我想lambda x: x.split('_')[0]
会更健壮一点。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.