Python：如何从两个数据框的系列中识别列表中的公共元素

Question

使用 Pandas，我将两个数据集存储在两个单独的数据框中。 每个 dataframe 由两个系列组成。

第一个 dataframe 有一个名为“name”的系列，第二个系列是一个字符串列表。 它看起来像这样：

                  name                           attributes
0                 John  [ABC, DEF, GHI, JKL, MNO, PQR, STU]
1                 Mike  [EUD, DBS, QMD, ABC, GHI]
2                 Jane  [JKL, EJD, MDE, MNO, DEF, ABC]
3                Kevin  [FHE, EUD, GHI, MNO, ABC, AUE, HSG, PEO]
4             Stefanie  [STU, EJD, DUE]

第二个 dataframe 与第一个系列相似

              username                                 attr
0           username_1  [DHD, EOA, AUE, CHE, ABC, PQR, QJF]
1           username_2  [ABC, EKR, ADT, GHI, JKL, EJD, MNO, MDE]
2           username_3  [DSB, AOD, DEF, MNO, DEF, ABC, TAE]
3           username_4  [DJH, EUD, GHI, MNO, ABC, FHE]
4           username_5  [CHQ, ELT, ABC, DEF, GHI]

我想要实现的是比较每个 dataframe 的属性（第二系列），以查看哪些名称和用户名共享最多的属性。

例如，username_4 的 6 个属性中有 5 个与 Kevin 的属性相匹配。

我想循环其中一个属性系列，看看另一个系列的每一行是否有匹配项，但无法有效循环（可能是因为我的列表在字符串周围没有引号？）。

我真的不知道存在什么可能性来比较这两个系列并最终得到如上所述的结果（username_4 的 6 个属性中有 5 个与 Kevin 的相匹配）。

这里可能的方法是什么？

Answer 1

您可以尝试如下方法：

# Import pandas library
import pandas as pd

# Create our data frames
data1 = [['John', ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']], ['Mike', ['EUD', 'DBS', 'QMD', 'ABC', 'GHI']],
['Jane', ['JKL', 'EJD', 'MDE', 'MNO', 'DEF', 'ABC']], ['Kevin', ['FHE', 'EUD', 'GHI', 'MNO', 'ABC', 'AUE', 'HSG', 'PEO']], 
['Stefanie', ['STU', 'EJD', 'DUE']]]

data2 = [['username_1', ['DHD', 'EOA', 'AUE', 'CHE', 'ABC', 'PQR', 'QJF']], ['username_2', ['ABC', 'EKR', 'ADT', 'GHI', 'JKL', 'EJD', 'MNO', 'MDE']],
['username_3', ['DSB', 'AOD', 'DEF', 'MNO', 'DEF', 'ABC', 'TAE']], ['username_4', ['DJH', 'EUD', 'GHI', 'MNO', 'ABC', 'FHE']], 
['username_5', ['CHQ', 'ELT', 'ABC', 'DEF', 'GHI']]]
  
# Create the pandas DataFrames with column name is provided explicitly
df1 = pd.DataFrame(data1, columns=['name', 'attributes'])
df2 = pd.DataFrame(data2, columns=['username', 'attr'])

# Create helper function to compare our two data frames
def func(inputDataFrame2, inputDataFrame1):
    outputDictionary = {} # Set a dictionary for our output
    for i, r in inputDataFrame2.iterrows(): # Loop over items in second data frame
        dictBuilder = {}
        for index, row in inputDataFrame1.iterrows(): # Loop over items in first data frame
            name = row['name']
            dictBuilder[name] = len([w for w in r['attr'] if w in row['attributes']]) # Get count of items in both lists
        maxKey = max(dictBuilder, key=dictBuilder.get) # Get the max value from the list of repeated items
        outputDictionary[r['username']] = [maxKey, dictBuilder[maxKey]] # Add name and count of attribute matches to dictionary
    print(outputDictionary) # Debug print statement
    return outputDictionary # Return our output dictionary here for further processing


a = func(df2, df1)

这应该会产生如下所示的 output：

{'username_1': ['John', 2], 'username_2': ['Jane', 5], 'username_3': ['John', 4], 'username_4': ['Kevin', 5], 'username_5': ['John', 3]}

从 outputDictionary 返回的字典中的每个项目将具有：

字典键值等于第二个数据框中的username
字典值等于一个列表，包含与我们的第一个数据框相比匹配最多的名称和计数

请注意，可以优化此方法如何循环遍历两个数据帧中的每一行 - 下面的线程描述了几种处理数据帧中行的不同方法：

如何在 Pandas 中迭代 DataFrame 中的行

Python：如何从两个数据框的系列中识别列表中的公共元素

问题描述

1 个解决方案

解决方案1
0 2022-11-14 08:40:52

Python：如何从两个数据框的系列中识别列表中的公共元素

问题描述

1 个解决方案

解决方案1 0 2022-11-14 08:40:52

解决方案1
0 2022-11-14 08:40:52