简体   繁体   中英

Pandas loop over 2 dataframe and drop duplicates

I have 2 csv files with some random numbers, as follow:

csv1.csv

0       906018
1       007559
2       910475
3       915104
4       600393
         ...  
5070    907525
5071    903079
5072    001910
5073    909735
5074    914861

length 5075

csv2.csv

0         5555
1         7859
2       501303
3       912414
4       913257
         ...  
7497    915031
7498    915030
7499    915033
7500    902060
7501    915038

length 7502

Some elements in csv1 are present in csv2 but I don't know exactly which one and I would like to extract those unique values. So my idea was to start merging together the 2 data frame, and than remove the duplicates.

so I wrote the following code:

import pandas as pd
import csv


unique_users = pd.read_csv('./csv1.csv')
unique_users['id']

identity = pd.read_csv('./csv2.csv')
identityNumber = identity['IDNumber']
identityNumber


df = pd.concat([identityNumber, unique_users])

Until here everything is perfect and the length of the df is the sum of the 2 length, but I realised the part where I got stuck.

the df concat it did its job and concat based on the index, so now I have tons of NaN.

and when I use the code:

final_result = df.drop_duplicates(keep=False)

The data frame does not drop any value because the df structure now look like this:

Identitynumber.    ID
5555               NaN

so I guess that drop duplicate is looking for the same exact values, but as they don't exist it just keeps it.

So what I would like to do, is loop over both data frame, and if a value in csv1 exists in csv2, I want them to be dropped.

Can anyone help with this please?

And please if you need more info just let me know.

UPDATE:

I think I found the reason why is not working but I am not sure how to solve this.

my csv1 looks like this:

id
906018,
007559,
910475,
915104,
600393,
007992,
502313,
004609,
910017,
007954,
006678,

in Jupiter notebook when I open the csv, it looks this way.


        id
906018  NaN
007559  NaN
910475  NaN
915104  NaN
600393  NaN
... ...
907525  NaN
903079  NaN
001910  NaN
909735  NaN
914861  NaN

and I do not understand why is seeing the id as NaN.

in fact I tried to add a new column into csv2, and as value I passed the id from csv1..and I can confirm that they are all NaN . So I believe the source of the problem is surely this, which than reflects on all the other events. Can anyone help to understand how I can solve this issue?

you can achieve this using df.merge() :

# Data samples
data_1 = {'col_a': [906018,7559,910475,915104,600393,907525,903079,1910,909735,914861]}
data_2 = {'col_b': [5555,7859,914861,912414,913257,915031,1910,915104,7559,915038]}

df1 = pd.DataFrame(data_1)
df2 = pd.DataFrame(data_2)

# using isin() method
unique_vals = df1.merge(df2, right_on='col_b', left_on='col_a')['col_a']
new_df1 = df1[~df1.col_a.isin(unique_vals)]

# another approach
new_df1 = df1[df1.merge(df2, right_on='col_b', left_on='col_a', how='left')['col_b'].isna()]

print(new_df1)
#    col_a
# 0  906018
# 2  910475
# 4  600393
# 5  907525
# 6  903079
# 8  909735

This will remove the duplicates between your two dataframes and keep all the records in one dataframe df.

df = pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)

You are getting NaN because when you concatenate, Pandas doesn't know what you want to do with the different column names of your two dataframes. One of your dataframes has an IdentityNumber column and the other has an ID column. Pandas can't figure out what you want, so it puts both columns into the resulting dataframe.

Try this:

pd.concat([df1["IDNumber"], df2["ID"]]).drop_duplicates().reset_index(drop=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM