简体   繁体   English

对两个熊猫数据框列求和,保留不常见的行

[英]sum two pandas dataframe columns, keep non-common rows

I just asked a similar question but then realized, it wasn't the right question. 我只是问了一个类似的问题,但后来意识到,这不是正确的问题。

What I'm trying to accomplish is to combine two data frames that actually have the same columns, but may or may not have common rows (indices of a MultiIndex). 我要完成的工作是合并实际上具有相同列但可能具有或可能没有公共行(MultiIndex的索引)的两个数据框。 I'd like to combine them taking the sum of one of the columns, but leaving the other columns. 我想将它们合并以一列的总和,而保留其他列。

According to the accepted answer, the approach may be something like: 根据公认的答案,该方法可能类似于:

def mklbl(prefix,n):
    try:
        return ["%s%s" % (prefix,i)  for i in range(n)]
    except:
        return ["%s%s" % (prefix,i) for i in n]



mi1 = pd.MultiIndex.from_product([mklbl('A',4), mklbl('C',2)])

mi2 = pd.MultiIndex.from_product([mklbl('A',[2,3,4]), mklbl('C',2)])

df2 = pd.DataFrame({'a':np.arange(len(mi1)), 'b':np.arange(len(mi1)),'c':np.arange(len(mi1)), 'd':np.arange(len(    mi1))[::-1]}, index=mi1).sort_index().sort_index(axis=1)    

df1 = pd.DataFrame({'a':np.arange(len(mi2)), 'b':np.arange(len(mi2)),'c':np.arange(len(mi2)), 'd':np.arange(len(    mi2))[::-1]}, index=mi2).sort_index().sort_index(axis=1)


df1 = df1.add(df2.pop('b'))

but the problem is this will fail as the indices don't align. 但是问题在于,由于索引不对齐,这将失败。

This is close to what I'm trying to achieve, except that I lose rows that are not common to the two dataframes: 这与我要达到的目标接近,除了我丢失了两个数据帧不共有的行:

df1['b'] = df1['b'].add(df2['b'], fill_value=0)

But this gives me: 但这给了我:

Out[197]: 
       a   b  c  d
A2 C0  0   4  0  5
   C1  1   6  1  4
A3 C0  2   8  2  3
   C1  3  10  3  2
A4 C0  4   4  4  1
   C1  5   5  5  0

When I want: 当我想要的时候:

In [197]: df1
Out[197]: 
       a   b  c  d
A0 C0  0  0  0  7
   C1  1  2  1  6
A1 C0  2  4  2  5
   C1  3  6  3  4
A2 C0  0   4  0  5
   C1  1   6  1  4
A3 C0  2   8  2  3
   C1  3  10  3  2
A4 C0  4   4  4  1
   C1  5   5  5  0

Note: in response to @RandyC's comment about the XY problem... the specific problem is that I have a class which reads data and returns a dataframe of 1e9 rows. 注意:响应@RandyC关于XY问题的评论... 特定的问题是我有一个读取数据并返回1e9行数据框的类。 The columns of the data frame are latll, latur, lonll, lonur, concentration, elevation . 数据框的列是latll, latur, lonll, lonur, concentration, elevation The data frame is indexed by a MultiIndex (lat, lon, time) where time is a datetime. 数据帧由MultiIndex (纬度,经度,时间)索引,其中时间是日期时间。 The rows of the two dataframes may/may not be the same (IF they exist for a given date, the lat/lon will be the same... they are grid cell centers). 两个数据框的行可能/可能不相同(如果它们在给定日期存在,则经/纬度将相同...它们是网格单元中心)。 latll, latur, lonll, lonur are calculated from lat/lon. latll, latur, lonll, lonur是根据lat / lon计算的。 I want to sum the concentration column as I add two data frames, but not change the others. 我想在添加两个数据框时对concentration列求和,但不更改其他数据框。

Self answering, there was an error in the comment above that caused a double adding. 自我回答,上面的评论中有一个错误,导致重复添加。 This is correct: 这是对的:

newdata = df2.pop('b')
result = df1.combine_first(df2)
result['b']= result['b'].add(newdata, fill_value=0)

seems to provide the solution to my use-case. 似乎为我的用例提供了解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM