简体   繁体   中英

Compare two Pandas dataframe for addition of any new rows with respect to the column

I am creating parser of changes on pseudo-table web application to push notification if there any rows were added.

Mechanic of the pseudo-table: Table on the website changes per some time and adds new rows. This page is highly dynamic and sometimes changes the existing rows. Pseudo-table automatically assigns id respecting to the sorting mechanic. So to explain precisely, sorting algorithm is alphabetic so guy ID named Adam would be 1, Bob = 2, Coul=3. But if they will add person with name Caul it would become ID 3, when Coul would become 4. This ruins all the methods I have tried so far.

I am trying right now to compare two Pandas dataframe to detect row addition and return new-added rows. I do not want to return existing rows that were changed. I tried by using concat and removing duplicates but this results in duplicate rows where there was any minor change in the data.

TL;DR EXAMPLE

Input

d1 = {'#': [1, 2, 3], 'Name': ['James Bourne', 'Steve Johns', 'Steve Jobs']}
d2 = {'#': [1, 2, 3, 4], 'Name': ['James Bourne', 'Steve Jobs', 'Great Guy', 'Steve Johns']}
df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)
# ... code

Output should be

3     Great Guy

merge dfs with (how = outer) , then compare merged df to list of original Names

>>> merged = pd.merge(df_1,df_2,on='Name', how = 'outer')
>>> [x for x in enumerate(merged.Name) if x[1] not in list(df_1.Name)]

Results in: [(3, 'Great Guy')]

I found out the subset parameter in the drop_duplicates.

d1 = {'#': [1, 2, 3], 'Name': ['James Bourne', 'Steve Johns', 'Steve Jobs']}
d2 = {'#': [1, 2, 3, 4], 'Name': ['James Bourne', 'Steve Jobs', 'Great Guy', 'Steve Johns']}
df_1 = pd.DataFrame(data=d1)
df_2 = pd.DataFrame(data=d2)
df_1 = df_1.set_index('#')
df_2 = df_2.set_index('#')
df = pd.concat([df_1,df_2]).drop_duplicates(subset=['Name'], keep=False)
df

results in

    Name
#   
3   Great Guy

This solves my question.

You could try a simpler solution:

df2[ ~df2.Name.isin(df1.Name)].dropna()

Output:

   #       Name
2  3  Great Guy

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM