简体   繁体   English

比较并删除 dataframe 中的重复项

[英]Compare and remove duplicates from both dataframe

I have 2 dataframes that needs to be compared and remove duplicates (if any)我有 2 个数据框需要比较并删除重复项(如果有)

Daily = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

Out[4]:
   col1  col2
0     1     2
1     2     3
2     3     4
   col1  col2
0     4     6
1     2     3
2     5     5
3     6     6

What I am trying to achieve is to remove duplicates if there are any, from both DF and get the count of remaining entries from daily DF我想要实现的是从两个 DF 中删除重复项(如果有),并从每日 DF 中获取剩余条目的计数

Expected output:预期 output:

   col1  col2
0     1     2
2     3     4
   col1  col2
0     4     6
2     5     5
3     6     6

Count = 2

How can i do it?我该怎么做? Both or either DFs can be empty, and daily can have more entries than Montlhy and vice versa两个或任何一个 DF 都可以为空,并且每天可以有比 Montlhy 更多的条目,反之亦然

Why don't just concat both into one df and drop the duplicates completely?为什么不将两者合并为一个concat并完全删除重复项?

s = (pd.concat([Daily.assign(source="Daily"),
               Accumulated.assign(source="Accumlated")])
       .drop_duplicates(["col1","col2"], keep=False))

print (s[s["source"].eq("Daily")])

   col1  col2 source
0     1     2  Daily
2     3     4  Daily

print (s[s["source"].eq("Accumlated")])

   col1  col2      source
0     4     6  Accumlated
2     5     5  Accumlated
3     6     6  Accumlated

You can try the below code你可以试试下面的代码

 ## For 1st Dataframe   
for i in range(len(df1)):
        for j in range(len(df2)):
            if df1.iloc[i].to_list()==df2.iloc[j].to_list():
                df1=df1.drop(index=i)

Similarly you can do for the second datframe同样,您可以为第二个数据框做

I would do it following way:我会这样做:

import pandas as pd
daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
daily['isdaily'] = True
accumulated['isdaily'] = False
together = pd.concat([daily, accumulated])
without_dupes = together.drop_duplicates(['col1','col2'],keep=False)
daily_count = sum(without_dupes['isdaily'])

I added isdaily column to dataframes as True s and False s so they could be easily sum med at end.我在数据帧中添加了isdaily列作为True s 和False s,这样它们就可以很容易地在最后进行sum

If I understood correctly, you need to have both tables separated.如果我理解正确,您需要将两个表分开。

You can concatenate them, keeping the table from where they come from and then recreate them:您可以连接它们,保留它们来自的表,然后重新创建它们:

Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Daily["Table"] = "Daily"
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Accumulated["Table"] = "Accum"

df = pd.concat([Daily, Accumulated]).reset_index()

not_dup = df[["col1", "col2"]].drop_duplicates()
not_dup = df.loc[not_dup.index,:]

Daily = not_dup[not_dup["Table"] == "Daily"][["col1","col2"]]
Accumulated = not_dup[not_dup["Table"] == "Accum"][["col1","col2"]]

print(Daily)
print(Accumulated)

following those steps:遵循这些步骤:

  1. Concatenate the 2 data-frames连接 2 个数据帧
  2. Drop all duplication删除所有重复项
  3. For each data-frame find the intersection with the concat data-frame对于每个数据帧,找到与 concat 数据帧的交集
  4. Find count with len用 len 查找计数
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

df = pd.concat([Daily, Accumulated]) # step 1
df = df.drop_duplicates(keep=False) # step 2

Daily = pd.merge(df, Daily, how='inner', on=['col1','col2']) #step 3
Accumulated = pd.merge(df, Accumulated, how='inner', on=['col1','col2']) #step 3

count = len(Daily) #step 4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM