Python数据框如何按一列分组并获得其他列的总和

Question

I want to create a new data frame which has 2 columns, grouped by Striker_Id and other column which has sum of 'Batsman_Scored' corresponding to the grouped 'Striker_Id'我想创建一个新的数据框，它有 2 列，按Striker_Id和其他列分组， Striker_Id列具有与分组的 'Striker_Id' 相对应的 'Batsman_Scored' 总和

Eg:例如：

Striker_ID  Batsman_Scored
1            0
2            8 
...

I tried this ball.groupby(['Striker_Id'])['Batsman_Scored'].sum() but this is what I get:我试过这个ball.groupby(['Striker_Id'])['Batsman_Scored'].sum()但这就是我得到的：

Striker_Id
1      0000040141000010111000001000020000004001010001...
2      0000000446404106064011111011100012106110621402...
3      0000121111114060001000101001011010010001041011...
4      0114110102100100011010000000006010011001111101...
5      0140016010010040000101111100101000111410011000...
6      1100100000104141011141001004001211200001110111...

It doesn't sum, only joins all the numbers.它不求和，只连接所有数字。 What's the alternative?什么是替代方案？

Answer 1

For some reason, your columns were loaded as strings.出于某种原因，您的列被加载为字符串。 While loading them from a CSV, try applying a converter -从 CSV 加载它们时，尝试应用转换器 -

df = pd.read_csv('file.csv', converters={'Batsman_Scored' : int})

Or,或者，

df = pd.read_csv('file.csv', converters={'Batsman_Scored' : pd.to_numeric})

If that doesn't work, then convert to integer after loading -如果这不起作用，则在加载后转换为整数 -

df['Batsman_Scored'] = df['Batsman_Scored'].astype(int)

Or,或者，

df['Batsman_Scored'] = pd.to_numeric(df['Batsman_Scored'], errors='coerce')

Now, performing the groupby should work -现在，执行 groupby 应该可以工作 -

r = df.groupby('Striker_Id')['Batsman_Scored'].sum()

Without access to your data, I can only speculate.无法访问您的数据，我只能推测。 But it seems like, at some point, your data contains non-numeric data that prevents pandas from being able to perform conversions, resulting in those columns being retained as strings.但似乎在某些时候，您的数据包含非数字数据，这些数据会阻止 Pandas 执行转换，导致这些列被保留为字符串。 It's a little difficult to pinpoint this problematic data until you actually load it in and do something like在您实际加载并执行类似操作之前，要查明这些有问题的数据有点困难

df.col.str.isdigit().any()

That'll tell you if there are any non-numeric items.这会告诉您是否有任何非数字项。 Note that it only works for integers, float columns cannot be debugged like this.请注意，它仅适用于整数，不能像这样调试浮点列。

Also, another way of seeing what columns have corrupt data would be to query dtypes -此外，查看哪些列具有损坏数据的另一种方法是查询dtypes -

df.dtypes

Which will give you a listing of all columns and their datatypes.这将为您提供所有列及其数据类型的列表。 Use this to figure out what columns need parsing -使用它来确定哪些列需要解析 -

for c in df.columns[df.dtypes == object]:
    print(c)

You can then apply the methods outlined above to fix them.然后，您可以应用上述方法来修复它们。

Python数据框如何按一列分组并获得其他列的总和

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-12-29 11:56:34

Python数据框如何按一列分组并获得其他列的总和

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-12-29 11:56:34

解决方案1
1 已采纳 2017-12-29 11:56:34