简体   繁体   中英

Searching values from one DataFrame in another DataFrame

Warning : you are about to see a very. very bad piece of code. I know it, I just don't know how to fix it. I have tried several alternatives but I lack experience in Pandas (or numpy - perhaps that is a better alernative here). You have been warned!

I have two dataframes and I need to find matching information from data frame one that exists on dataframe two. Let me show you:

# DataFrame 1
d1 = {'name': ['John Doe', 'Jane Doe'], 
'email': ['john@example.com', 'jane@example.com'], 
'phone': ['15181111111', '15182222222']}

df1 = pd.DataFrame(data=d1)
###
# DataFrame 2
d2 = {'name': ['Fred Flinstone', 'Barney Rubble', 'Betty Rubble'], 
'email': ['john@example.com', 'barney@example.com', 'betty@example.com'], 
'Mobile': ['15183333333', '15182222222', '15184444444'], 
'LandLine': ['15181111111', '15182222222', '15185555555']}

df2 = pd.DataFrame(data=d2)

So my objective is to find which rows in df2 match each piece (but the name) of available data in df1 (email, phone). When a match is found I need to keep a record of all the data from both dataframes.

Now, start biting your nails, take a deep breath and see the disgrace that I am doing. It does work but you will quickly realize what the issues are:

# Empty dataframe to store matches
df_found = pd.DataFrame(columns=['df1 Name', 'df1 email', 'df1 phone', 'df2 name', 'df2 email', 'df2 mobile', 'df2 landline'])

# Search for matches
for row_df1 in df1.itertuples():
    tmp_df = df2[df2['email'].str.contains(row_df1.email, na=False, case=False)]
    if(len(tmp_df) > 0):
        for row_df2 in tmp_df.itertuples():
            df_found.loc[len(df_found)] = [row_df1.name, row_df1.email, row_df1.phone, row_df2.name, row_df2.email, row_df2.Mobile, row_df2.LandLine]
    
    tmp_df = df2[df2['Mobile'].str.contains(row_df1.phone, na=False, case=False)]
    if(len(tmp_df) > 0):
        for row_df2 in tmp_df.itertuples():
            df_found.loc[len(df_found)] = [row_df1.name, row_df1.email, row_df1.phone, row_df2.name, row_df2.email, row_df2.Mobile, row_df2.LandLine]
    
    tmp_df = df2[df2['LandLine'].str.contains(row_df1.phone, na=False, case=False)]
    if(len(tmp_df) > 0):
        for row_df2 in tmp_df.itertuples():
            df_found.loc[len(df_found)] = [row_df1.name, row_df1.email, row_df1.phone, row_df2.name, row_df2.email, row_df2.Mobile, row_df2.LandLine]

#Drop duplicates - Yes of course there are many
df_found.drop_duplicates(keep='first',inplace=True)

There you go, I have a series of loops inside a loop, each one of them traversing the same data and fattening a temporary dataframe and a match holder dataframe.

At the end I get my result:

在此处输入图片说明

But the speed is horrible. My real dataframes have 29 columns the first and 55 columns the second. There are around 100 thousand records in the first and around half a million in the second. Right now the process takes around four hours in my i7 with no GPU and 16GB RAM.

If you are already able to breath and stopped banging your head agains the wall, I'd appreciate some ideas on how to do this right.

Thank you very much!

O(n^2) operations

Adding a single row to a dataframe requires copying the entire dataframe - so building up a dataframe one row at a time is an O(n^2) operation, and very slow. Also, Series.str.contains requires checking every single string value for whether it's contained. Since you're comparing every row to every other row, that too is an O(n^2) operation.

In general, single-row operations in Pandas indicate very slow code.

Replace for loops with merge

You can do a SQL-style join to do what you're trying to do here.

email_merge = df1.merge(df2, on=["email"], suffixes=("", "_right"))
mobile_merge = df1.merge(df2, left_on=["phone"], right_on=["Mobile"], suffixes=("", "_right"))
landline_merge = df1.merge(df2, left_on=["phone"], right_on=["LandLine"], suffixes=("", "_right"))

The first line does a join between email fields. The second join targets the first kind of phone. The third join targets the second kind of phone. You're going to end up with quite a lot of duplicates doing this, by the way.

You can then concatenate each of these dataframes together:

print(pd.concat([email_merge, landline_merge, mobile_merge], sort=True))

This gives me the following result:

      LandLine       Mobile             email         email_right      name      name_right        phone
0  15181111111  15183333333  john@example.com                 NaN  John Doe  Fred Flinstone  15181111111
0  15181111111  15183333333  john@example.com    john@example.com  John Doe  Fred Flinstone  15181111111
1  15182222222  15182222222  jane@example.com  barney@example.com  Jane Doe   Barney Rubble  15182222222
0  15182222222  15182222222  jane@example.com  barney@example.com  Jane Doe   Barney Rubble  15182222222

Try to merge dataframes:

df_found = pd.merge(df1.loc[:, df1.columns != 'name'], df2.loc[:, df2.columns != 'name'], how='inner')
df_found = df_found.merge(df1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM