[英]JOIN two DataFrames and replace Column values in Python
I have dataframe df1:我有 dataframe df1:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 5
3 txn vol(new) 2020-02-01 20
4 txn vol(tenu) 2020-01-01 30
5 txn vol(tenu) 2020-02-01 40
Second Dataframe df2:第二个 Dataframe df2:
Expenses Calendar Actual
0 txn vol(new) 2020-01-01 23
1 txn vol(new) 2020-02-01 32
2 txn vol(tenu) 2020-01-01 60
Now I wanted to read all data from df1, and join on df2 with Expenses + Calendar, then replace actual value in df1 from df2.现在我想从 df1 读取所有数据,并使用费用 + 日历加入 df2,然后从 df2 替换 df1 中的实际值。
Expected output is:预期的 output 是:
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40
I am using below code我正在使用下面的代码
cols_to_replace = ['Actual']
df1.loc[df1.set_index(['Calendar','Expenses']).index.isin(df2.set_index(['Calendar','Expenses']).index), cols_to_replace] = df2.loc[df2.set_index(['Calendar','Expenses']).index.isin(df1.set_index(['Calendar','Expenses']).index),cols_to_replace].values
It is working when I have small data in df1.当我在 df1 中有小数据时它正在工作。 When it has (10K records), updates are happening with wrong values.当它有(10K 条记录)时,更新会发生错误的值。 df1 has 10K records, and df2 has 150 records. df1 有 10K 条记录,df2 有 150 条记录。 Could anyone please suggest how to resolve this?谁能建议如何解决这个问题?
Thank you谢谢
here is one way to do it, using pd.merge这是使用 pd.merge 的一种方法
df=df.merge(df2,
on=['Expenses', 'Calendar'],
how='left',
suffixes=('_x', None)).ffill(axis=1).drop(columns='Actual_x')
df['Actual']=df['Actual'].astype(int)
df
Expenses Calendar Actual
0 xyz 2020-01-01 10
1 xyz 2020-02-01 99
2 txn vol(new) 2020-01-01 23
3 txn vol(new) 2020-02-01 32
4 txn vol(tenu) 2020-01-01 60
5 txn vol(tenu) 2020-02-01 40
If I understand your solution correctly, it seems to assume that (1) the Calendar
- Expenses
combinations are unique and (2) that their occurrences in both dataframes are aligned (same order)?如果我正确理解您的解决方案,似乎假设(1) Calendar
- Expenses
组合是唯一的,并且(2)它们在两个数据框中的出现是对齐的(相同的顺序)? I suspect that (2) isn't actually the case?我怀疑(2)实际上并非如此?
Another option - .merge()
is fine: - could be:另一种选择 - .merge()
很好: - 可能是:
df1 = df1.set_index(["Expenses", "Calendar"])
df2 = df2.set_index(["Expenses", "Calendar"])
df1.loc[df2.index, "Actual"] = df2["Actual"]
df2 = df2.reset_index() # If the original df2 is still needed
df1 = df1.reset_index()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.