简体   繁体   English

pandas 在列值匹配时使用来自另一个数据帧的值更新数据帧

[英]pandas update a dataframe with values from another dataframe on the match of column values

I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns我有一个数据框,其中股票代码名称和日期为两列,我想用另一个与这两列匹配的更大数据框的价格值来更新这个数据框

Eg: df1:例如:df1:

ticker  Date
AAPL    2022-01-03
GE      2022-04-18

df2: df2:

ticker   Date             Close
AAPL     2022-01-02       120
AAPL     2022-01-03       122
AAPL     2022-01-04       125
AAPL     2022-01-05       121
.
.
.
GE     2022-04-16       20
GE     2022-04-17       22
GE     2022-04-18       25
GE     2022-04-19       21

The output should be:输出应该是:

ticker  Date         Close
AAPL    2022-01-03   122
GE      2022-04-18   25

I can do a loop and update row by row, but i would like to check if there is a pythonic way using the whole series/vectors...我可以做一个循环并逐行更新,但我想检查是否有使用整个系列/向量的pythonic方式......

TL;DR : If you can index your dataframes in advance, you can do about 10,000 times better than merge() for each individual join of two dataframes (one with 500 rows, one with 25 million); TL;DR :如果您可以提前索引您的数据帧,那么对于两个数据帧的每个单独连接(一个有 500 行,一个有 2500 万行),您可以比merge()好大约 10,000 倍 if you are doing only a single join of such dataframes, merge() is about as fast as the alternatives.如果您只对此类数据帧进行一次连接,则 merge() 与其他方法一样快。


Your question says:你的问题说:

I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns我有一个数据框,其中股票代码名称和日期为两列,我想用另一个与这两列匹配的更大数据框的价格值来更新这个数据框

... and you ask: ...你问:

i would like to check if there is a pythonic way using the whole series/vectors我想检查是否有使用整个系列/向量的pythonic方式

If your question is asked in the context of just this single dataframe query, then you probably can't get better performance than merge() .如果您的问题是在这个单一数据框查询的上下文中提出的,那么您可能无法获得比merge()更好的性能。

However, if you have the option of initializing your dataframes to use ticker, Date as their index, or if you can at least set their indexes to ticker, Date before needing to run multiple queries of the kind described in your question, then you can beat merge() .但是,如果您可以选择初始化数据框以使用ticker, Date作为它们的索引,或者如果您至少可以将它们的索引设置为ticker, Date ,然后需要运行问题中描述的多个查询,那么您可以击败merge()

Here is a benchmark of 6 different strategies for df1 with 500 rows and df2 with 25 million rows:以下是 df1 500 行和 df2 2500 万行的 6 种不同策略的基准:

Timeit results:
foo_1 (merge) ran in 23.043055499998445 seconds
foo_2 (indexed join) ran in 51.69773360000181 seconds
foo_3 (pre-indexed join) ran in 0.0027679000013449695 seconds
foo_4 (pre-indexed df1 join) ran in 24.431038499998976 seconds
foo_5 (merge right) ran in 22.99117219999971 seconds
foo_6 (pre-indexed assign) ran in 0.007970200000272598 seconds

Note that pre-indexed join is about 10,000x faster than merge (and pre-indexed assign is also quick at about 3,000x faster), primarily because pre-indexed dataframes have hash table access by index which have key search time of O(1) vs worst case O(n) time for non-indexed keys.请注意, pre-indexed joinmerge快约 10,000 倍(并且pre-indexed assign也快约 3,000 倍),主要是因为预索引数据帧通过索引访问哈希表,其关键字搜索时间为 O(1 ) 与非索引键的最坏情况 O(n) 时间。 However, indexed join is more than twice as slow as merge because indexed join includes the initial indexing effort (which can be done just once for multiple queries such as the one in your question, and which is excluded from pre-indexed join ).但是, indexed join的速度是merge的两倍多,因为indexed join包括初始索引工作(对于多个查询,例如您的问题中的查询,只需执行一次,并且不包括在pre-indexed join之外)。

Explanation of the various strategies:各种策略的解释:

  • The merge strategy uses no indexing. merge策略不使用索引。
  • The indexed join strategy includes the time to index both dataframes. indexed join策略包括索引两个数据帧的时间。
  • The pre-indexed join strategy excludes the initial overhead of indexing both dataframes. pre-indexed join策略不包括索引两个数据帧的初始开销。
  • The pre-indexed df1 join strategy excludes the initial overhead of indexing df1 but works with an unindexed df2. pre-indexed df1 join策略不包括索引 df1 的初始开销,但适用于未索引的 df2。
  • The merge right strategy swaps df1 and df2 as object and argument of merge() . merge right策略将 df1 和 df2 交换为merge()的对象和参数。
  • The pre-indexed assign strategy doesn't use merge() or join() , instead doing an index-aligned assignment from df2 to a new column in df1. pre-indexed assign策略不使用merge()join() ,而是从 df2 到 df1 中的新列执行索引对齐分配。

Here's the code for each strategy:以下是每种策略的代码:

df1_orig = pd.DataFrame([('A'+str(i) , f'{2020+i//365}-{(i//28)%12 + 1}-{i%28 + 1}') for i in range(500)], columns=['ticker', 'Date'])
print(df1_orig)

df2_orig = pd.DataFrame([('A'+str(i) , f'{2020+(i%500)//365}-{((i%500)//28)%12 + 1}-{(i%500)%28 + 1}', (10 * i + 1) % 300) for i in range(25_000_000)], columns=['ticker', 'Date', 'Close'])
print(df2_orig)

df1_indexed_orig = df1_orig.set_index(['ticker', 'Date'])
df2_indexed_orig = df2_orig.set_index(['ticker', 'Date'])

# merge
def foo_1(df1, df2):
    df1 = df1.merge(df2, on = ['ticker', 'Date'], how = 'left')
    return df1

# indexed join
def foo_2(df1, df2):
    df1.set_index(['ticker', 'Date'], inplace=True)
    df2.set_index(['ticker', 'Date'], inplace=True)
    df1 = df1.join(df2)
    return df1

# pre-indexed join
def foo_3(df1, df2):
    # called with df1_indexed_orig and df2_indexed_orig
    df1 = df1.join(df2)
    return df1

# pre-indexed df1 join
def foo_4(df1, df2):
    # called with df1_indexed_orig
    df1 = df2.join(df1, on = ['ticker', 'Date'], how = 'right')
    return df1

# merge right
def foo_5(df1, df2):
    df1 = df2.merge(df1, on = ['ticker', 'Date'], how = 'right')
    return df1

# pre-indexed assign
def foo_6(df1, df2):
    # called with df1_indexed_orig and df2_indexed_orig
    df1 = df1.assign(Close=df2.Close)
    return df1

尝试合并这两列:

df1.merge(df2, on = ['ticker', 'Date'], how = 'left')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM