[英]pandas update a dataframe with values from another dataframe on the match of column values
I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns我有一个数据框,其中股票代码名称和日期为两列,我想用另一个与这两列匹配的更大数据框的价格值来更新这个数据框
Eg: df1:例如:df1:
ticker Date
AAPL 2022-01-03
GE 2022-04-18
df2: df2:
ticker Date Close
AAPL 2022-01-02 120
AAPL 2022-01-03 122
AAPL 2022-01-04 125
AAPL 2022-01-05 121
.
.
.
GE 2022-04-16 20
GE 2022-04-17 22
GE 2022-04-18 25
GE 2022-04-19 21
The output should be:输出应该是:
ticker Date Close
AAPL 2022-01-03 122
GE 2022-04-18 25
I can do a loop and update row by row, but i would like to check if there is a pythonic way using the whole series/vectors...我可以做一个循环并逐行更新,但我想检查是否有使用整个系列/向量的pythonic方式......
TL;DR : If you can index your dataframes in advance, you can do about 10,000 times better than merge()
for each individual join of two dataframes (one with 500 rows, one with 25 million); TL;DR :如果您可以提前索引您的数据帧,那么对于两个数据帧的每个单独连接(一个有 500 行,一个有 2500 万行),您可以比
merge()
好大约 10,000 倍; if you are doing only a single join of such dataframes, merge() is about as fast as the alternatives.如果您只对此类数据帧进行一次连接,则 merge() 与其他方法一样快。
Your question says:你的问题说:
I have a dataframe with stock ticker names and dates as two columns, and i would like to update this dataframe with price value from another larger dataframe matching those 2 columns
我有一个数据框,其中股票代码名称和日期为两列,我想用另一个与这两列匹配的更大数据框的价格值来更新这个数据框
... and you ask: ...你问:
i would like to check if there is a pythonic way using the whole series/vectors
我想检查是否有使用整个系列/向量的pythonic方式
If your question is asked in the context of just this single dataframe query, then you probably can't get better performance than merge()
.如果您的问题是在这个单一数据框查询的上下文中提出的,那么您可能无法获得比
merge()
更好的性能。
However, if you have the option of initializing your dataframes to use ticker, Date
as their index, or if you can at least set their indexes to ticker, Date
before needing to run multiple queries of the kind described in your question, then you can beat merge()
.但是,如果您可以选择初始化数据框以使用
ticker, Date
作为它们的索引,或者如果您至少可以将它们的索引设置为ticker, Date
,然后需要运行问题中描述的多个查询,那么您可以击败merge()
。
Here is a benchmark of 6 different strategies for df1 with 500 rows and df2 with 25 million rows:以下是 df1 500 行和 df2 2500 万行的 6 种不同策略的基准:
Timeit results:
foo_1 (merge) ran in 23.043055499998445 seconds
foo_2 (indexed join) ran in 51.69773360000181 seconds
foo_3 (pre-indexed join) ran in 0.0027679000013449695 seconds
foo_4 (pre-indexed df1 join) ran in 24.431038499998976 seconds
foo_5 (merge right) ran in 22.99117219999971 seconds
foo_6 (pre-indexed assign) ran in 0.007970200000272598 seconds
Note that pre-indexed join
is about 10,000x faster than merge
(and pre-indexed assign
is also quick at about 3,000x faster), primarily because pre-indexed dataframes have hash table access by index which have key search time of O(1) vs worst case O(n) time for non-indexed keys.请注意,
pre-indexed join
比merge
快约 10,000 倍(并且pre-indexed assign
也快约 3,000 倍),主要是因为预索引数据帧通过索引访问哈希表,其关键字搜索时间为 O(1 ) 与非索引键的最坏情况 O(n) 时间。 However, indexed join
is more than twice as slow as merge
because indexed join
includes the initial indexing effort (which can be done just once for multiple queries such as the one in your question, and which is excluded from pre-indexed join
).但是,
indexed join
的速度是merge
的两倍多,因为indexed join
包括初始索引工作(对于多个查询,例如您的问题中的查询,只需执行一次,并且不包括在pre-indexed join
之外)。
Explanation of the various strategies:各种策略的解释:
merge
strategy uses no indexing. merge
策略不使用索引。indexed join
strategy includes the time to index both dataframes. indexed join
策略包括索引两个数据帧的时间。pre-indexed join
strategy excludes the initial overhead of indexing both dataframes. pre-indexed join
策略不包括索引两个数据帧的初始开销。pre-indexed df1 join
strategy excludes the initial overhead of indexing df1 but works with an unindexed df2. pre-indexed df1 join
策略不包括索引 df1 的初始开销,但适用于未索引的 df2。merge right
strategy swaps df1 and df2 as object and argument of merge()
. merge right
策略将 df1 和 df2 交换为merge()
的对象和参数。pre-indexed assign
strategy doesn't use merge()
or join()
, instead doing an index-aligned assignment from df2 to a new column in df1. pre-indexed assign
策略不使用merge()
或join()
,而是从 df2 到 df1 中的新列执行索引对齐分配。 Here's the code for each strategy:以下是每种策略的代码:
df1_orig = pd.DataFrame([('A'+str(i) , f'{2020+i//365}-{(i//28)%12 + 1}-{i%28 + 1}') for i in range(500)], columns=['ticker', 'Date'])
print(df1_orig)
df2_orig = pd.DataFrame([('A'+str(i) , f'{2020+(i%500)//365}-{((i%500)//28)%12 + 1}-{(i%500)%28 + 1}', (10 * i + 1) % 300) for i in range(25_000_000)], columns=['ticker', 'Date', 'Close'])
print(df2_orig)
df1_indexed_orig = df1_orig.set_index(['ticker', 'Date'])
df2_indexed_orig = df2_orig.set_index(['ticker', 'Date'])
# merge
def foo_1(df1, df2):
df1 = df1.merge(df2, on = ['ticker', 'Date'], how = 'left')
return df1
# indexed join
def foo_2(df1, df2):
df1.set_index(['ticker', 'Date'], inplace=True)
df2.set_index(['ticker', 'Date'], inplace=True)
df1 = df1.join(df2)
return df1
# pre-indexed join
def foo_3(df1, df2):
# called with df1_indexed_orig and df2_indexed_orig
df1 = df1.join(df2)
return df1
# pre-indexed df1 join
def foo_4(df1, df2):
# called with df1_indexed_orig
df1 = df2.join(df1, on = ['ticker', 'Date'], how = 'right')
return df1
# merge right
def foo_5(df1, df2):
df1 = df2.merge(df1, on = ['ticker', 'Date'], how = 'right')
return df1
# pre-indexed assign
def foo_6(df1, df2):
# called with df1_indexed_orig and df2_indexed_orig
df1 = df1.assign(Close=df2.Close)
return df1
尝试合并这两列:
df1.merge(df2, on = ['ticker', 'Date'], how = 'left')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.