简体   繁体   中英

Compare each row in one dataframe to each row in another dataframe in Python

I have two different dataframes with the same features.

df1


   AGE   Country     Income
   -----------------------
   33    UK          3500
   24    Australia   1500


df2

   AGE   Country     Income
   -----------------------
   33    Brazil      1300
   54    Australia   2230

I would like to compare each row in df1 to each row df2, and compute the number of differences found in the features values.

In my example, we have 2 dataframes, each dataframe has 2 instances. So, will have 4 sort of comparisons.

For each comparison, i need to return the number of features differences. For example, if we compare the first row in df1 to the first row in df2, we will have 2 differences in the feature values.

Any idea how to implement that?

If I understand correctly, an approach would be to use np.where() and to calculate for each feature individually the number of differences per row and sum these arrays:

arr = np.where(df_1['Age']!=df_2['Age'],1,0) + np.where(df_1['Country'] != df_2['Country'],1,0) + np.where(df_1['Income']!=df_2['Income'],1,0)

This will return an array with the number of feature-differences per each row. In this case, the output would be:

[2,2]

If there are many columns like in the example below, you can use a for loop:

df_1 = pd.DataFrame({'Age':[1,2,3,4],'Country':['Brazil','UK','Australia','China'],'Var_x':[7,5,7,7],'Var_y':[3,6,3,2],'Var_z':[20,32,31,34]}) 
df_2 = pd.DataFrame({'Age':[1,2,4,5],'Country':['Egypt','UK','India','China'],'Var_x':[7,4,3,7],'Var_y':[3,6,2,2],'Var_z':[20,32,4,32]})
differences = np.zeros(len(df_1))
for i in df_1:
  differences += np.where(df_1[i]!=df_2[i],1,0)
print(differences)

Output:

[1. 1. 5. 2.]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM