比较并删除 dataframe 中的重复项

Question

I have 2 dataframes that needs to be compared and remove duplicates (if any)我有 2 个数据框需要比较并删除重复项（如果有）

Daily = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

Out[4]:
   col1  col2
0     1     2
1     2     3
2     3     4
   col1  col2
0     4     6
1     2     3
2     5     5
3     6     6

What I am trying to achieve is to remove duplicates if there are any, from both DF and get the count of remaining entries from daily DF我想要实现的是从两个 DF 中删除重复项（如果有），并从每日 DF 中获取剩余条目的计数

Expected output:预期 output：

   col1  col2
0     1     2
2     3     4
   col1  col2
0     4     6
2     5     5
3     6     6

Count = 2

How can i do it?我该怎么做？ Both or either DFs can be empty, and daily can have more entries than Montlhy and vice versa两个或任何一个 DF 都可以为空，并且每天可以有比 Montlhy 更多的条目，反之亦然

Answer 1

Why don't just concat both into one df and drop the duplicates completely?为什么不将两者合并为一个concat并完全删除重复项？

s = (pd.concat([Daily.assign(source="Daily"),
               Accumulated.assign(source="Accumlated")])
       .drop_duplicates(["col1","col2"], keep=False))

print (s[s["source"].eq("Daily")])

   col1  col2 source
0     1     2  Daily
2     3     4  Daily

print (s[s["source"].eq("Accumlated")])

   col1  col2      source
0     4     6  Accumlated
2     5     5  Accumlated
3     6     6  Accumlated

Answer 2

You can try the below code你可以试试下面的代码

 ## For 1st Dataframe   
for i in range(len(df1)):
        for j in range(len(df2)):
            if df1.iloc[i].to_list()==df2.iloc[j].to_list():
                df1=df1.drop(index=i)

Similarly you can do for the second datframe同样，您可以为第二个数据框做

Answer 3

I would do it following way:我会这样做：

import pandas as pd
daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
daily['isdaily'] = True
accumulated['isdaily'] = False
together = pd.concat([daily, accumulated])
without_dupes = together.drop_duplicates(['col1','col2'],keep=False)
daily_count = sum(without_dupes['isdaily'])

I added isdaily column to dataframes as True s and False s so they could be easily sum med at end.我在数据帧中添加了isdaily列作为True s 和False s，这样它们就可以很容易地在最后进行sum 。

Answer 4

If I understood correctly, you need to have both tables separated.如果我理解正确，您需要将两个表分开。

You can concatenate them, keeping the table from where they come from and then recreate them:您可以连接它们，保留它们来自的表，然后重新创建它们：

Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Daily["Table"] = "Daily"
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Accumulated["Table"] = "Accum"

df = pd.concat([Daily, Accumulated]).reset_index()

not_dup = df[["col1", "col2"]].drop_duplicates()
not_dup = df.loc[not_dup.index,:]

Daily = not_dup[not_dup["Table"] == "Daily"][["col1","col2"]]
Accumulated = not_dup[not_dup["Table"] == "Accum"][["col1","col2"]]

print(Daily)
print(Accumulated)

Answer 5

following those steps:遵循这些步骤：

Concatenate the 2 data-frames连接 2 个数据帧
Drop all duplication删除所有重复项
For each data-frame find the intersection with the concat data-frame对于每个数据帧，找到与 concat 数据帧的交集
Find count with len用 len 查找计数

Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

df = pd.concat([Daily, Accumulated]) # step 1
df = df.drop_duplicates(keep=False) # step 2

Daily = pd.merge(df, Daily, how='inner', on=['col1','col2']) #step 3
Accumulated = pd.merge(df, Accumulated, how='inner', on=['col1','col2']) #step 3

count = len(Daily) #step 4

比较并删除 dataframe 中的重复项

问题描述

5 个解决方案

解决方案1
3 已采纳 2020-08-12 10:12:18

解决方案2
1 2020-08-12 10:11:42

解决方案3
0 2020-08-12 10:12:27

解决方案4
0 2020-08-12 10:12:40

解决方案5
0 2020-08-12 10:22:57

比较并删除 dataframe 中的重复项

问题描述

5 个解决方案

解决方案1 3 已采纳 2020-08-12 10:12:18

解决方案2 1 2020-08-12 10:11:42

解决方案3 0 2020-08-12 10:12:27

解决方案4 0 2020-08-12 10:12:40

解决方案5 0 2020-08-12 10:22:57

解决方案1
3 已采纳 2020-08-12 10:12:18

解决方案2
1 2020-08-12 10:11:42

解决方案3
0 2020-08-12 10:12:27

解决方案4
0 2020-08-12 10:12:40

解决方案5
0 2020-08-12 10:22:57