[英]Python Pandas Group Dataframe by Column / Sum Integer Column by String Column
I have been stuck all day and have been through numerous SO articles and am still stuck on my last final piece. 我整日都被困住了,读了很多SO文章,但仍然停留在我最后的最后一篇文章中。 I imported a CSV into a massive dataframe, then eventually got the smaller dataframe below: (Note: My df is indexed on 'Name' right now, which is what I need to base the group or sum off of)
我将CSV导入了一个庞大的数据框,然后最终得到了下面的较小数据框:(注意:我的df现在在“名称”上建立了索引,这是我需要根据组或求和的基础)
Name Classification Value 1 Value 2
Company 1 Classification Code 1 5000 8000
Company 1 Classification Code 1 6000 2000
Company 2 Classification Code 1 2000 3000
Company 2 Classification Code 1 1000 4500
Company 3 Classification Code 2 15000 10000
Company 3 Classification Code 2 20000 32000
Company 4 Classification Code 3 7500 10000
Company 4 Classification Code 3 7000 1500
What I am struggling with now is how to sum the two values based on the company (I have mainly been using groupby and sum() but have been stuck for hours. I know there are a lot of SO articles talking about summing things in pandas but I have had no luck for hours. ANY help would be greatly appreciated. Thanks so much. 我现在正在努力的是如何基于公司对两个值求和(我主要使用groupby和sum(),但是已经停滞了几个小时。我知道有很多关于在熊猫中对事物求和的SO文章。但我已经好几个小时没有运气了,我们将不胜感激,非常感谢。
Edit: The output I am looking for is the following 编辑:我正在寻找的输出如下
Company 1 Classification Code 1 11,000 10,000
Company 2 Classification Code 1 3,000 7,500
Company 3 Classification Code 2 35,000 42,000
Company 4 Classification Code 3 14,500 11,500
Option 1 选项1
set_index
then groupby
set_index
然后groupby
This assumes that the 'Classification'
column is the same across Company
这假定
'Classification'
列是相同的跨Company
df.set_index('Classification', append=True) \
.groupby(level=[0, 1]).sum().reset_index(1)
Classification Value 1 Value 2
Name
Company 1 Classification Code 1 11000 10000
Company 2 Classification Code 1 3000 7500
Company 3 Classification Code 2 35000 42000
Company 4 Classification Code 3 14500 11500
Option 2 选项2
groupby
then agg
groupby
然后agg
This doesn't make any assumptions about uniqueness of 'Classification'
across 'Company'
but will just grab the first 'Classification'
per 'Company'
这不会对
'Company'
中'Classification'
唯一性做出任何假设,而只会获取每个'Company'
的第一个'Classification'
'Company'
df.groupby(level=0).agg(
{'Classification': 'first', 'Value 1': 'sum', 'Value 2': 'sum'})
Classification Value 1 Value 2
Name
Company 1 Classification Code 1 11000 10000
Company 2 Classification Code 1 3000 7500
Company 3 Classification Code 2 35000 42000
Company 4 Classification Code 3 14500 11500
Response to Comments 对评论的回应
In regards to concatenation 关于串联
Check dtypes
with df.dtypes
. 检查
dtypes
与df.dtypes
。 If you see object
instead of int
then yes, you need to convert to numeric
. 如果看到
object
而不是int
则是,您需要转换为numeric
。
You can do this simply with 您可以简单地通过
df.apply(pd.to_numeric, errors='ignore').groupby(level=0).agg(
{'Classification': 'first', 'Value 1': 'sum', 'Value 2': 'sum'})
Or more manually 或更手动
df['Value 1'] = df['Value 1'].astype(int)
df['Value 2'] = df['Value 2'].astype(int)
Then proceed to prior suggestions. 然后继续进行先前的建议。
In regards to placement of columns 关于列的放置
You can always reorder your columns 您可以随时对列进行重新排序
d1 = df.apply(pd.to_numeric, errors='ignore').groupby(level=0).agg(
{'Classification': 'first', 'Value 1': 'sum', 'Value 2': 'sum'})
d1[df.columns]
Or 要么
d1 = df.apply(pd.to_numeric, errors='ignore').groupby(level=0).agg(
{'Classification': 'first', 'Value 1': 'sum', 'Value 2': 'sum'})
d1.reindex_axis(df.columns, 1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.