简体   繁体   English

Python:如何从两个数据框的系列中识别列表中的公共元素

[英]Python: how to identify common elements in lists from two dataframes' series

Using Pandas, I have two data sets stored in two separate dataframes.使用 Pandas,我将两个数据集存储在两个单独的数据框中。 Each dataframe is composed of two series.每个 dataframe 由两个系列组成。

The first dataframe has a series called 'name', the second series is a list of strings.第一个 dataframe 有一个名为“name”的系列,第二个系列是一个字符串列表。 It looks something like this:它看起来像这样:

                  name                           attributes
0                 John  [ABC, DEF, GHI, JKL, MNO, PQR, STU]
1                 Mike  [EUD, DBS, QMD, ABC, GHI]
2                 Jane  [JKL, EJD, MDE, MNO, DEF, ABC]
3                Kevin  [FHE, EUD, GHI, MNO, ABC, AUE, HSG, PEO]
4             Stefanie  [STU, EJD, DUE]

The second dataframe is similar with the first series being第二个 dataframe 与第一个系列相似

              username                                 attr
0           username_1  [DHD, EOA, AUE, CHE, ABC, PQR, QJF]
1           username_2  [ABC, EKR, ADT, GHI, JKL, EJD, MNO, MDE]
2           username_3  [DSB, AOD, DEF, MNO, DEF, ABC, TAE]
3           username_4  [DJH, EUD, GHI, MNO, ABC, FHE]
4           username_5  [CHQ, ELT, ABC, DEF, GHI]

What I'm trying to achieve is to compare the attributes (second series) of each dataframe to see which names and usernames share the most attributes.我想要实现的是比较每个 dataframe 的属性(第二系列),以查看哪些名称和用户名共享最多的属性。

For example, username_4 has 5 out of 6 attributes matching those of Kevin's.例如,username_4 的 6 个属性中有 5 个与 Kevin 的属性相匹配。

I thought of looping one of the attributes series and see if there's a match in each row of the other series but couldn't loop effectively (maybe because my lists don't have quotation marks around the strings?).我想循环其中一个属性系列,看看另一个系列的每一行是否有匹配项,但无法有效循环(可能是因为我的列表在字符串周围没有引号?)。

I don't really know what possibilities exist to compare those two series and end up with a result as mentioned above (username_4 has 5 out of 6 attributes matching those of Kevin's).我真的不知道存在什么可能性来比较这两个系列并最终得到如上所述的结果(username_4 的 6 个属性中有 5 个与 Kevin 的相匹配)。

What would be the possible approach(es) here?这里可能的方法是什么?

You could try a method like below:您可以尝试如下方法:

# Import pandas library
import pandas as pd

# Create our data frames
data1 = [['John', ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']], ['Mike', ['EUD', 'DBS', 'QMD', 'ABC', 'GHI']],
['Jane', ['JKL', 'EJD', 'MDE', 'MNO', 'DEF', 'ABC']], ['Kevin', ['FHE', 'EUD', 'GHI', 'MNO', 'ABC', 'AUE', 'HSG', 'PEO']], 
['Stefanie', ['STU', 'EJD', 'DUE']]]

data2 = [['username_1', ['DHD', 'EOA', 'AUE', 'CHE', 'ABC', 'PQR', 'QJF']], ['username_2', ['ABC', 'EKR', 'ADT', 'GHI', 'JKL', 'EJD', 'MNO', 'MDE']],
['username_3', ['DSB', 'AOD', 'DEF', 'MNO', 'DEF', 'ABC', 'TAE']], ['username_4', ['DJH', 'EUD', 'GHI', 'MNO', 'ABC', 'FHE']], 
['username_5', ['CHQ', 'ELT', 'ABC', 'DEF', 'GHI']]]
  
# Create the pandas DataFrames with column name is provided explicitly
df1 = pd.DataFrame(data1, columns=['name', 'attributes'])
df2 = pd.DataFrame(data2, columns=['username', 'attr'])

# Create helper function to compare our two data frames
def func(inputDataFrame2, inputDataFrame1):
    outputDictionary = {} # Set a dictionary for our output
    for i, r in inputDataFrame2.iterrows(): # Loop over items in second data frame
        dictBuilder = {}
        for index, row in inputDataFrame1.iterrows(): # Loop over items in first data frame
            name = row['name']
            dictBuilder[name] = len([w for w in r['attr'] if w in row['attributes']]) # Get count of items in both lists
        maxKey = max(dictBuilder, key=dictBuilder.get) # Get the max value from the list of repeated items
        outputDictionary[r['username']] = [maxKey, dictBuilder[maxKey]] # Add name and count of attribute matches to dictionary
    print(outputDictionary) # Debug print statement
    return outputDictionary # Return our output dictionary here for further processing


a = func(df2, df1)

That should yield an output like below:这应该会产生如下所示的 output:

{'username_1': ['John', 2], 'username_2': ['Jane', 5], 'username_3': ['John', 4], 'username_4': ['Kevin', 5], 'username_5': ['John', 3]}

Where each item in the dictionary returned from outputDictionary will have:从 outputDictionary 返回的字典中的每个项目将具有:

  • Dictionary key value equal to the username from the second data frame字典键值等于第二个数据框中的username
  • Dictionary value equal to a list, containing the name and count with the most matches as compared to our first data frame字典值等于一个列表,包含与我们的第一个数据框相比匹配最多的名称和计数

Note that this method could be optimized in how it loops over each row in the two data frames - The thread below describes a few different ways to process rows in data frames:请注意,可以优化此方法如何循环遍历两个数据帧中的每一行 - 下面的线程描述了几种处理数据帧中行的不同方法:

How to iterate over rows in a DataFrame in Pandas 如何在 Pandas 中迭代 DataFrame 中的行

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM