[英]Compare and remove duplicates from both dataframe
我有 2 個數據框需要比較並刪除重復項(如果有)
Daily = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Out[4]:
col1 col2
0 1 2
1 2 3
2 3 4
col1 col2
0 4 6
1 2 3
2 5 5
3 6 6
我想要實現的是從兩個 DF 中刪除重復項(如果有),並從每日 DF 中獲取剩余條目的計數
預期 output:
col1 col2
0 1 2
2 3 4
col1 col2
0 4 6
2 5 5
3 6 6
Count = 2
我該怎么做? 兩個或任何一個 DF 都可以為空,並且每天可以有比 Montlhy 更多的條目,反之亦然
為什么不將兩者合並為一個concat
並完全刪除重復項?
s = (pd.concat([Daily.assign(source="Daily"),
Accumulated.assign(source="Accumlated")])
.drop_duplicates(["col1","col2"], keep=False))
print (s[s["source"].eq("Daily")])
col1 col2 source
0 1 2 Daily
2 3 4 Daily
print (s[s["source"].eq("Accumlated")])
col1 col2 source
0 4 6 Accumlated
2 5 5 Accumlated
3 6 6 Accumlated
你可以試試下面的代碼
## For 1st Dataframe
for i in range(len(df1)):
for j in range(len(df2)):
if df1.iloc[i].to_list()==df2.iloc[j].to_list():
df1=df1.drop(index=i)
同樣,您可以為第二個數據框做
我會這樣做:
import pandas as pd
daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
daily['isdaily'] = True
accumulated['isdaily'] = False
together = pd.concat([daily, accumulated])
without_dupes = together.drop_duplicates(['col1','col2'],keep=False)
daily_count = sum(without_dupes['isdaily'])
我在數據幀中添加了isdaily
列作為True
s 和False
s,這樣它們就可以很容易地在最后進行sum
。
如果我理解正確,您需要將兩個表分開。
您可以連接它們,保留它們來自的表,然后重新創建它們:
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Daily["Table"] = "Daily"
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Accumulated["Table"] = "Accum"
df = pd.concat([Daily, Accumulated]).reset_index()
not_dup = df[["col1", "col2"]].drop_duplicates()
not_dup = df.loc[not_dup.index,:]
Daily = not_dup[not_dup["Table"] == "Daily"][["col1","col2"]]
Accumulated = not_dup[not_dup["Table"] == "Accum"][["col1","col2"]]
print(Daily)
print(Accumulated)
遵循這些步驟:
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
df = pd.concat([Daily, Accumulated]) # step 1
df = df.drop_duplicates(keep=False) # step 2
Daily = pd.merge(df, Daily, how='inner', on=['col1','col2']) #step 3
Accumulated = pd.merge(df, Accumulated, how='inner', on=['col1','col2']) #step 3
count = len(Daily) #step 4
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.