pandas DataFrame 连接/更新（“更新插入”）？

Question

我正在寻找一种优雅的方式来 append 从一个 DataFrame 到另一个 DataFrame 的所有行（两个 DataFrames 都具有相同的索引和列数据结构的行）框架。

因此，例如，如果我从以下内容开始：

df1:
                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'A2'   'B2'
    '2015-10-03'  'A3'   'B3'

df2:
    date            A      B
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

我希望结果是：

                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

这类似于我认为在某些 SQL 系统中所谓的“upsert”——更新和插入的组合，因为df2中的每一行要么（a）用于更新df1中的现有行，如果该行key 已经存在于df1中，或者（b）如果行键不存在，则在末尾插入df1 。

我想出了以下内容

pd.concat([df1, df2])     # concat the two DataFrames
    .reset_index()        # turn 'date' into a regular column
    .groupby('date')      # group rows by values in the 'date' column
    .tail(1)              # take the last row in each group
    .set_index('date')    # restore 'date' as the index

这似乎可行，但这依赖于每个 groupby 组中的行顺序始终与原始 DataFrames 相同，我没有检查过，而且似乎令人不快地令人费解。

有人对更直接的解决方案有任何想法吗？

Answer 1

一种解决方案是将df1与df2新行连接起来（即索引不匹配的地方）。 然后使用df2的值更新值。

df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)

>>> df
             A   B
2015-10-01  A1  B1
2015-10-02  a1  b1
2015-10-03  a2  b2
2015-10-04  a3  b3

编辑：根据@chrisb 的建议，这可以进一步简化如下：

pd.concat([df1[~df1.index.isin(df2.index)], df2])

谢谢克里斯！

Answer 2

从pandas 1.0.3 ，所需的 UPSERT 功能由combine_first直接提供：

combined = df2.combine_first(df1)

print(combined)
#               A   B
# 2015-10-01    A1  B1
# 2015-10-02    a1  b1
# 2015-10-03    a2  b2
# 2015-10-04    a3  b3

要获得这种 UPSERT 行为，其数据具有优先级的数据帧（更新的数据帧，在本例中为df2 ）必须是调用该函数的数据帧。

它基本上：(1) 协调行和列，(2) 优先考虑非 NaN 数据，以及 (3) 如果在两个数据帧中定义的数据点，则优先考虑df2数据，这基本上是您想要的。

Answer 3

除了正确答案之外，还要注意两个数据框中都不存在的列：

    df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
    df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)

如果您按原样使用上述解决方案，您将获得：

    >>>     1   2
    0       
    test    1   True
    test2   4   NaN
    test3   3   NaN

但是，如果您期待以下输出：

    >>>     1   2
    0       
    test    1   True
    test2   4   True
    test3   3   NaN

只需将语句更改为：

    df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
    df1.update(df2)

Answer 4

这个问题对我很有帮助。

@Alexander 的回答已经足够好了。

我只想稍微解释一下两种解决方案的不同之处：

多行

def upsert(target, new)
  df = pd.concat([target, new[~new.index.isin(target.index)]], sort=False)
  df.update(new)
  return df

和

单线

def upsert(target, new):
  return pd.concat([target[~target.index.isin(new.index)], new], sort=False)

有一个例子可以解释差异：

>>> df = pd.DataFrame([('python', None, 4),
                       ('java',   '1992', 3),
                       ('javascript', None, 2)],
                      columns=['lang', 'year', 'rank'])
>>> df
    lang    year    rank
0   python  None    4
1   java    1992    3
2   javascript  None    2

>>> upsert_df = pd.DataFrame([('python', '1987'),
                              ('GOLANG', '2009')],
                             columns=['lang', 'year'])
>>> upsert_df
    lang    year
0   python  1987
1   GOLANG  2009

>>> target, new = df.set_index(['lang']), upsert.set_index(['lang'])
>>> upsert(target, new, full_update=True)
    year    rank
lang        
java    1992    3.0
javascript  None    2.0
python  1987    NaN
GOLANG  2009    NaN

>>> upsert(target, new, full_update=False)
    year    rank
lang        
python  1987    4.0
java    1992    3.0
c   1976    1.0
javascript  None    2.0
GOLANG  2009    NaN

像这样的upsert函数：

def upsert(target, new, *, full_update, sort=False) -> pd.DataFrame:
  """do a upsert in pandas

  :param target: to be upsert;
  :param new:  upsert data to target;
  :param full_update: bool: NA;
  :param sort: bool: if new.columns != target.columns, the new version of pandas execute .concat will sort columns by default.

  Reference: https://stackoverflow.com/a/33002097
  """

  if full_update:
    return pd.concat([target[~target.index.isin(new.index)], new], sort=sort)
  else:
    df = pd.concat([target, new[~new.index.isin(target.index)]], sort=sort)
    df.update(new)
    return df

pandas DataFrame 连接/更新（“更新插入”）？

问题描述

3 个解决方案

解决方案1
33 已采纳 2015-10-07 20:44:50

解决方案2
4 2021-02-26 05:25:33

解决方案3
3 2019-05-08 15:36:36

解决方案4
-3 2020-12-24 04:15:02

pandas DataFrame 连接/更新（“更新插入”）？

问题描述

3 个解决方案

解决方案1 33 已采纳 2015-10-07 20:44:50

解决方案2 4 2021-02-26 05:25:33

解决方案3 3 2019-05-08 15:36:36

解决方案4 -3 2020-12-24 04:15:02

解决方案1
33 已采纳 2015-10-07 20:44:50

解决方案2
4 2021-02-26 05:25:33

解决方案3
3 2019-05-08 15:36:36

解决方案4
-3 2020-12-24 04:15:02