Warning : you are about to see a very. very bad piece of code. I know it, I just don't know how to fix it. I have tried several alternatives but I lack experience in Pandas (or numpy - perhaps that is a better alernative here). You have been warned!
I have two dataframes and I need to find matching information from data frame one that exists on dataframe two. Let me show you:
# DataFrame 1
d1 = {'name': ['John Doe', 'Jane Doe'],
'email': ['john@example.com', 'jane@example.com'],
'phone': ['15181111111', '15182222222']}
df1 = pd.DataFrame(data=d1)
###
# DataFrame 2
d2 = {'name': ['Fred Flinstone', 'Barney Rubble', 'Betty Rubble'],
'email': ['john@example.com', 'barney@example.com', 'betty@example.com'],
'Mobile': ['15183333333', '15182222222', '15184444444'],
'LandLine': ['15181111111', '15182222222', '15185555555']}
df2 = pd.DataFrame(data=d2)
So my objective is to find which rows in df2
match each piece (but the name) of available data in df1
(email, phone). When a match is found I need to keep a record of all the data from both dataframes.
Now, start biting your nails, take a deep breath and see the disgrace that I am doing. It does work but you will quickly realize what the issues are:
# Empty dataframe to store matches
df_found = pd.DataFrame(columns=['df1 Name', 'df1 email', 'df1 phone', 'df2 name', 'df2 email', 'df2 mobile', 'df2 landline'])
# Search for matches
for row_df1 in df1.itertuples():
tmp_df = df2[df2['email'].str.contains(row_df1.email, na=False, case=False)]
if(len(tmp_df) > 0):
for row_df2 in tmp_df.itertuples():
df_found.loc[len(df_found)] = [row_df1.name, row_df1.email, row_df1.phone, row_df2.name, row_df2.email, row_df2.Mobile, row_df2.LandLine]
tmp_df = df2[df2['Mobile'].str.contains(row_df1.phone, na=False, case=False)]
if(len(tmp_df) > 0):
for row_df2 in tmp_df.itertuples():
df_found.loc[len(df_found)] = [row_df1.name, row_df1.email, row_df1.phone, row_df2.name, row_df2.email, row_df2.Mobile, row_df2.LandLine]
tmp_df = df2[df2['LandLine'].str.contains(row_df1.phone, na=False, case=False)]
if(len(tmp_df) > 0):
for row_df2 in tmp_df.itertuples():
df_found.loc[len(df_found)] = [row_df1.name, row_df1.email, row_df1.phone, row_df2.name, row_df2.email, row_df2.Mobile, row_df2.LandLine]
#Drop duplicates - Yes of course there are many
df_found.drop_duplicates(keep='first',inplace=True)
There you go, I have a series of loops inside a loop, each one of them traversing the same data and fattening a temporary dataframe and a match holder dataframe.
At the end I get my result:
But the speed is horrible. My real dataframes have 29 columns the first and 55 columns the second. There are around 100 thousand records in the first and around half a million in the second. Right now the process takes around four hours in my i7 with no GPU and 16GB RAM.
If you are already able to breath and stopped banging your head agains the wall, I'd appreciate some ideas on how to do this right.
Thank you very much!
Adding a single row to a dataframe requires copying the entire dataframe - so building up a dataframe one row at a time is an O(n^2) operation, and very slow. Also, Series.str.contains requires checking every single string value for whether it's contained. Since you're comparing every row to every other row, that too is an O(n^2) operation.
In general, single-row operations in Pandas indicate very slow code.
You can do a SQL-style join to do what you're trying to do here.
email_merge = df1.merge(df2, on=["email"], suffixes=("", "_right"))
mobile_merge = df1.merge(df2, left_on=["phone"], right_on=["Mobile"], suffixes=("", "_right"))
landline_merge = df1.merge(df2, left_on=["phone"], right_on=["LandLine"], suffixes=("", "_right"))
The first line does a join between email fields. The second join targets the first kind of phone. The third join targets the second kind of phone. You're going to end up with quite a lot of duplicates doing this, by the way.
You can then concatenate each of these dataframes together:
print(pd.concat([email_merge, landline_merge, mobile_merge], sort=True))
This gives me the following result:
LandLine Mobile email email_right name name_right phone
0 15181111111 15183333333 john@example.com NaN John Doe Fred Flinstone 15181111111
0 15181111111 15183333333 john@example.com john@example.com John Doe Fred Flinstone 15181111111
1 15182222222 15182222222 jane@example.com barney@example.com Jane Doe Barney Rubble 15182222222
0 15182222222 15182222222 jane@example.com barney@example.com Jane Doe Barney Rubble 15182222222
Try to merge dataframes:
df_found = pd.merge(df1.loc[:, df1.columns != 'name'], df2.loc[:, df2.columns != 'name'], how='inner')
df_found = df_found.merge(df1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.