Using Pandas, I have two data sets stored in two separate dataframes. Each dataframe is composed of two series.
The first dataframe has a series called 'name', the second series is a list of strings. It looks something like this:
name attributes
0 John [ABC, DEF, GHI, JKL, MNO, PQR, STU]
1 Mike [EUD, DBS, QMD, ABC, GHI]
2 Jane [JKL, EJD, MDE, MNO, DEF, ABC]
3 Kevin [FHE, EUD, GHI, MNO, ABC, AUE, HSG, PEO]
4 Stefanie [STU, EJD, DUE]
The second dataframe is similar with the first series being
username attr
0 username_1 [DHD, EOA, AUE, CHE, ABC, PQR, QJF]
1 username_2 [ABC, EKR, ADT, GHI, JKL, EJD, MNO, MDE]
2 username_3 [DSB, AOD, DEF, MNO, DEF, ABC, TAE]
3 username_4 [DJH, EUD, GHI, MNO, ABC, FHE]
4 username_5 [CHQ, ELT, ABC, DEF, GHI]
What I'm trying to achieve is to compare the attributes (second series) of each dataframe to see which names and usernames share the most attributes.
For example, username_4 has 5 out of 6 attributes matching those of Kevin's.
I thought of looping one of the attributes series and see if there's a match in each row of the other series but couldn't loop effectively (maybe because my lists don't have quotation marks around the strings?).
I don't really know what possibilities exist to compare those two series and end up with a result as mentioned above (username_4 has 5 out of 6 attributes matching those of Kevin's).
What would be the possible approach(es) here?
You could try a method like below:
# Import pandas library
import pandas as pd
# Create our data frames
data1 = [['John', ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']], ['Mike', ['EUD', 'DBS', 'QMD', 'ABC', 'GHI']],
['Jane', ['JKL', 'EJD', 'MDE', 'MNO', 'DEF', 'ABC']], ['Kevin', ['FHE', 'EUD', 'GHI', 'MNO', 'ABC', 'AUE', 'HSG', 'PEO']],
['Stefanie', ['STU', 'EJD', 'DUE']]]
data2 = [['username_1', ['DHD', 'EOA', 'AUE', 'CHE', 'ABC', 'PQR', 'QJF']], ['username_2', ['ABC', 'EKR', 'ADT', 'GHI', 'JKL', 'EJD', 'MNO', 'MDE']],
['username_3', ['DSB', 'AOD', 'DEF', 'MNO', 'DEF', 'ABC', 'TAE']], ['username_4', ['DJH', 'EUD', 'GHI', 'MNO', 'ABC', 'FHE']],
['username_5', ['CHQ', 'ELT', 'ABC', 'DEF', 'GHI']]]
# Create the pandas DataFrames with column name is provided explicitly
df1 = pd.DataFrame(data1, columns=['name', 'attributes'])
df2 = pd.DataFrame(data2, columns=['username', 'attr'])
# Create helper function to compare our two data frames
def func(inputDataFrame2, inputDataFrame1):
outputDictionary = {} # Set a dictionary for our output
for i, r in inputDataFrame2.iterrows(): # Loop over items in second data frame
dictBuilder = {}
for index, row in inputDataFrame1.iterrows(): # Loop over items in first data frame
name = row['name']
dictBuilder[name] = len([w for w in r['attr'] if w in row['attributes']]) # Get count of items in both lists
maxKey = max(dictBuilder, key=dictBuilder.get) # Get the max value from the list of repeated items
outputDictionary[r['username']] = [maxKey, dictBuilder[maxKey]] # Add name and count of attribute matches to dictionary
print(outputDictionary) # Debug print statement
return outputDictionary # Return our output dictionary here for further processing
a = func(df2, df1)
That should yield an output like below:
{'username_1': ['John', 2], 'username_2': ['Jane', 5], 'username_3': ['John', 4], 'username_4': ['Kevin', 5], 'username_5': ['John', 3]}
Where each item in the dictionary returned from outputDictionary will have:
username
from the second data frameNote that this method could be optimized in how it loops over each row in the two data frames - The thread below describes a few different ways to process rows in data frames:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.