Python：如何從兩個數據框的系列中識別列表中的公共元素

Question

使用 Pandas，我將兩個數據集存儲在兩個單獨的數據框中。 每個 dataframe 由兩個系列組成。

第一個 dataframe 有一個名為“name”的系列，第二個系列是一個字符串列表。 它看起來像這樣：

                  name                           attributes
0                 John  [ABC, DEF, GHI, JKL, MNO, PQR, STU]
1                 Mike  [EUD, DBS, QMD, ABC, GHI]
2                 Jane  [JKL, EJD, MDE, MNO, DEF, ABC]
3                Kevin  [FHE, EUD, GHI, MNO, ABC, AUE, HSG, PEO]
4             Stefanie  [STU, EJD, DUE]

第二個 dataframe 與第一個系列相似

              username                                 attr
0           username_1  [DHD, EOA, AUE, CHE, ABC, PQR, QJF]
1           username_2  [ABC, EKR, ADT, GHI, JKL, EJD, MNO, MDE]
2           username_3  [DSB, AOD, DEF, MNO, DEF, ABC, TAE]
3           username_4  [DJH, EUD, GHI, MNO, ABC, FHE]
4           username_5  [CHQ, ELT, ABC, DEF, GHI]

我想要實現的是比較每個 dataframe 的屬性（第二系列），以查看哪些名稱和用戶名共享最多的屬性。

例如，username_4 的 6 個屬性中有 5 個與 Kevin 的屬性相匹配。

我想循環其中一個屬性系列，看看另一個系列的每一行是否有匹配項，但無法有效循環（可能是因為我的列表在字符串周圍沒有引號？）。

我真的不知道存在什么可能性來比較這兩個系列並最終得到如上所述的結果（username_4 的 6 個屬性中有 5 個與 Kevin 的相匹配）。

這里可能的方法是什么？

Answer 1

您可以嘗試如下方法：

# Import pandas library
import pandas as pd

# Create our data frames
data1 = [['John', ['ABC', 'DEF', 'GHI', 'JKL', 'MNO', 'PQR', 'STU']], ['Mike', ['EUD', 'DBS', 'QMD', 'ABC', 'GHI']],
['Jane', ['JKL', 'EJD', 'MDE', 'MNO', 'DEF', 'ABC']], ['Kevin', ['FHE', 'EUD', 'GHI', 'MNO', 'ABC', 'AUE', 'HSG', 'PEO']], 
['Stefanie', ['STU', 'EJD', 'DUE']]]

data2 = [['username_1', ['DHD', 'EOA', 'AUE', 'CHE', 'ABC', 'PQR', 'QJF']], ['username_2', ['ABC', 'EKR', 'ADT', 'GHI', 'JKL', 'EJD', 'MNO', 'MDE']],
['username_3', ['DSB', 'AOD', 'DEF', 'MNO', 'DEF', 'ABC', 'TAE']], ['username_4', ['DJH', 'EUD', 'GHI', 'MNO', 'ABC', 'FHE']], 
['username_5', ['CHQ', 'ELT', 'ABC', 'DEF', 'GHI']]]
  
# Create the pandas DataFrames with column name is provided explicitly
df1 = pd.DataFrame(data1, columns=['name', 'attributes'])
df2 = pd.DataFrame(data2, columns=['username', 'attr'])

# Create helper function to compare our two data frames
def func(inputDataFrame2, inputDataFrame1):
    outputDictionary = {} # Set a dictionary for our output
    for i, r in inputDataFrame2.iterrows(): # Loop over items in second data frame
        dictBuilder = {}
        for index, row in inputDataFrame1.iterrows(): # Loop over items in first data frame
            name = row['name']
            dictBuilder[name] = len([w for w in r['attr'] if w in row['attributes']]) # Get count of items in both lists
        maxKey = max(dictBuilder, key=dictBuilder.get) # Get the max value from the list of repeated items
        outputDictionary[r['username']] = [maxKey, dictBuilder[maxKey]] # Add name and count of attribute matches to dictionary
    print(outputDictionary) # Debug print statement
    return outputDictionary # Return our output dictionary here for further processing


a = func(df2, df1)

這應該會產生如下所示的 output：

{'username_1': ['John', 2], 'username_2': ['Jane', 5], 'username_3': ['John', 4], 'username_4': ['Kevin', 5], 'username_5': ['John', 3]}

從 outputDictionary 返回的字典中的每個項目將具有：

字典鍵值等於第二個數據框中的username
字典值等於一個列表，包含與我們的第一個數據框相比匹配最多的名稱和計數

請注意，可以優化此方法如何循環遍歷兩個數據幀中的每一行 - 下面的線程描述了幾種處理數據幀中行的不同方法：

如何在 Pandas 中迭代 DataFrame 中的行

Python：如何從兩個數據框的系列中識別列表中的公共元素

問題描述

1 個解決方案

解決方案1
0 2022-11-14 08:40:52

Python：如何從兩個數據框的系列中識別列表中的公共元素

問題描述

1 個解決方案

解決方案1 0 2022-11-14 08:40:52

解決方案1
0 2022-11-14 08:40:52