简体   繁体   English

如何使用公共密钥对来自三个不同数据帧的列进行求和

[英]How to sum columns from three different dataframes with a common key

I am reading in an excel spreadsheet about schools with three sheets as follows. 我正在阅读有关三张学校的excel电子表格,如下所示。

import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
print xl.sheet_names
df1 = xl.parse(xl.sheet_names[0], skiprows=14)
df2 = xl.parse(xl.sheet_names[1], skiprows=14)
df3 = xl.parse(xl.sheet_names[2], skiprows=14)
df1.columns = [chr(65+i) for i in xrange(len(df1.columns))]
df2.columns = df1.columns
df3.columns = df1.columns

The unique id for each school is in column 'D' in each of the three dataframes. 每个学校的唯一ID在三个数据框的每一个中都在“D”列中。 I would like to make a new dataframe which has two columns. 我想创建一个有两列的新数据框。 The first is the sum of column 'G' from df1, df2, df3 and the second is the sum of column 'K' from df1, df2, df3. 第一个是来自df1,df2,df3的列'G'的总和,第二个是来自df1,df2,df3的列'K'的总和。 In other words, I think I need the following steps. 换句话说,我认为我需要以下步骤。

  1. Filter rows for which unique column 'D' ids actually exist in all three dataframes. 筛选在所有三个数据框中实际存在唯一列“D”ID的行。 If the school doesn't appear in all three sheets then I discard it. 如果学校没有出现在所有三张纸上,那么我就丢弃它。
  2. For each remaining row (school), add up the values in column 'G' in the three dataframes. 对于每个剩余的行(学校),将三个数据框中的“G”列中的值相加。
  3. Do the same for column 'K'. 对列'K'执行相同操作。

I am new to pandas but how should I do this? 我是熊猫的新手,但我该怎么办呢? Somehow the unique ids have to be used in steps 2 and 3 to make sure the values that are added correspond to the same school. 不知何故,必须在步骤2和3中使用唯一ID,以确保添加的值对应于同一所学校。


Attempted solution 试图解决方案

df1 = df1.set_index('D')
df2 = df2.set_index('D')
df3 = df3.set_index('D')
df1['SumK']= df1['K'] +  df2['K'] + df3['K']
df1['SumG']= df1['G'] +  df2['G'] + df3['G']

After concatenating the dataframes, you can use groupby and count to get a list of values for "D" that exist in all three dataframes since there is only one in each dataframe. 连接数据帧后,您可以使用groupbycount来获取所有三个数据帧中存在的“D”值列表,因为每个数据帧中只有一个。 You can then use this to filter concatenated dataframe to sum whichever columns you need, eg: 然后,您可以使用它来过滤连接的数据帧,以汇总您需要的任何列,例如:

df = pd.concat([df1, df2, df3])
criteria = df.D.isin((df.groupby('D').count() == 3).index)
df[criteria].groupby('D')[['G', 'K']].sum()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM