简体   繁体   English

加入两个数据帧并替换 Python 中的列值

[英]JOIN two DataFrames and replace Column values in Python

I have dataframe df1:我有 dataframe df1:

    Expenses        Calendar    Actual
0   xyz             2020-01-01  10
1   xyz             2020-02-01  99
2   txn vol(new)    2020-01-01  5
3   txn vol(new)    2020-02-01  20
4   txn vol(tenu)   2020-01-01  30
5   txn vol(tenu)   2020-02-01  40

Second Dataframe df2:第二个 Dataframe df2:

    Expenses    Calendar    Actual
0   txn vol(new)    2020-01-01  23
1   txn vol(new)    2020-02-01  32
2   txn vol(tenu)   2020-01-01  60

Now I wanted to read all data from df1, and join on df2 with Expenses + Calendar, then replace actual value in df1 from df2.现在我想从 df1 读取所有数据,并使用费用 + 日历加入 df2,然后从 df2 替换 df1 中的实际值。

Expected output is:预期的 output 是:

    Expenses        Calendar    Actual
0   xyz             2020-01-01  10
1   xyz             2020-02-01  99
2   txn vol(new)    2020-01-01  23
3   txn vol(new)    2020-02-01  32
4   txn vol(tenu)   2020-01-01  60
5   txn vol(tenu)   2020-02-01  40

I am using below code我正在使用下面的代码

cols_to_replace = ['Actual']
df1.loc[df1.set_index(['Calendar','Expenses']).index.isin(df2.set_index(['Calendar','Expenses']).index), cols_to_replace] = df2.loc[df2.set_index(['Calendar','Expenses']).index.isin(df1.set_index(['Calendar','Expenses']).index),cols_to_replace].values

It is working when I have small data in df1.当我在 df1 中有小数据时它正在工作。 When it has (10K records), updates are happening with wrong values.当它有(10K 条记录)时,更新会发生错误的值。 df1 has 10K records, and df2 has 150 records. df1 有 10K 条记录,df2 有 150 条记录。 Could anyone please suggest how to resolve this?谁能建议如何解决这个问题?

Thank you谢谢

here is one way to do it, using pd.merge这是使用 pd.merge 的一种方法

df=df.merge(df2,
        on=['Expenses', 'Calendar'],
        how='left',
        suffixes=('_x', None)).ffill(axis=1).drop(columns='Actual_x')
df['Actual']=df['Actual'].astype(int)
df
Expenses              Calendar  Actual
0   xyz             2020-01-01      10
1   xyz             2020-02-01      99
2   txn vol(new)    2020-01-01      23
3   txn vol(new)    2020-02-01      32
4   txn vol(tenu)   2020-01-01      60
5   txn vol(tenu)   2020-02-01      40

If I understand your solution correctly, it seems to assume that (1) the Calendar - Expenses combinations are unique and (2) that their occurrences in both dataframes are aligned (same order)?如果我正确理解您的解决方案,似乎假设(1) Calendar - Expenses组合是唯一的,并且(2)它们在两个数据框中的出现是对齐的(相同的顺序)? I suspect that (2) isn't actually the case?我怀疑(2)实际上并非如此?

Another option - .merge() is fine: - could be:另一种选择 - .merge()很好: - 可能是:

df1 = df1.set_index(["Expenses", "Calendar"])
df2 = df2.set_index(["Expenses", "Calendar"])
df1.loc[df2.index, "Actual"] = df2["Actual"]
df2 = df2.reset_index()  # If the original df2 is still needed
df1 = df1.reset_index()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM