簡體   English   中英

計算 python 中 2 個數據幀的匹配百分比

[英]Calculate matching percentage of 2 dataframes in python

df1上加入df2時,使用First_NameLast_NameEmail ,如何計算可以匹配到df1df2百分比?

df1:

    First_Name  Last_Name   Email                     Value1
0   Aaron       Potter      aaronpotter@gmail.com     10
1   Bella       Granger     bellagranger@gmail.com    2
2   Ron         Black       black@hotmail.com         20
3   Harry       Weasley     harryweasley@hotmail.com  11

df2

    First_Name  Last_Name   Email                     Value2
0   Aaron       Potter      aaronpotter@gmail.com     10
1   Ronald      Black       ronaldblack@hotmail.com   5
2   Bella       Granger     bellagranger@gmail.com    2
3   Harry       Weasley     tomriddle@hotmail.com     20

例如,在這種情況下,匹配百分比為 4 分之 2。

@anky 有一個很好的解決方案。 我將在merge中提供indicator參數以直觀地查看匹配項。

df_out = df1.merge(df2, on = ['First_Name', 'Last_Name', 'Email'], 
          indicator='Matched', how='out')
df_out

Output:

  First_Name Last_Name                     Email  Value1  Value2     Matched
0      Aaron    Potter     aaronpotter@gmail.com    10.0    10.0        both
1      Bella   Granger    bellagranger@gmail.com     2.0     2.0        both
2        Ron     Black         black@hotmail.com    20.0     NaN   left_only
3      Harry   Weasley  harryweasley@hotmail.com    11.0     NaN   left_only
4     Ronald     Black   ronaldblack@hotmail.com     NaN     5.0  right_only
5      Harry   Weasley     tomriddle@hotmail.com     NaN    20.0  right_only

或者,左連接:

df_out = df1.merge(df2, on = ['First_Name', 'Last_Name', 'Email'], 
          indicator='Matched', how='left')
print(df_out)

Output:

  First_Name Last_Name                     Email  Value1  Value2    Matched
0      Aaron    Potter     aaronpotter@gmail.com      10    10.0       both
1      Bella   Granger    bellagranger@gmail.com       2     2.0       both
2        Ron     Black         black@hotmail.com      20     NaN  left_only
3      Harry   Weasley  harryweasley@hotmail.com      11     NaN  left_only

並使用@anky 的解決方案:

(df_out['Matched'] == 'both').sum()/df_out.shape[0]

Output:

0.5

@Scott Boston 的答案是完美的,如果您只有“First_Name”、“Last_Name”和“Email”。 您可以使用以下代碼。

df = pd.concat([df1[['First_Name','Last_Name','Email']],df2[['First_Name','Last_Name','Email']]])
df = df.reset_index(drop=True)
gb = df.groupby(list(df.columns))
idx = [x[0] for x in gb.groups.values() if len(x) == 2]
df.reindex(idx)

    First_Name  Last_Name   Email
0   Aaron   Potter  aaronpotter@gmail.com
1   Bella   Granger bellagranger@gmail.com

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM