如何在 Python 中使用贪婪方法配对两个数据帧中最相似的列

Question

我有两个大小为 24x10 的数据框（虽然实际的 df 大小很大）。 所有列的配对都需要以贪心的方式完成，方法是枚举 df1 中的列并找到 df2 中最相似（不一定完全相同）的列。 结果将最终生成 df1 的每一列与 df2 中未分配的列配对。 dfs如下。

df1 = pd.DataFrame([[1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 0.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 2., 1., 1., 1., 0., 1., 1., 0.],
   [2., 1., 1., 3., 1., 1., 0., 1., 1., 1.],
   [2., 1., 1., 2., 1., 1., 1., 1., 1., 0.],
   [2., 1., 1., 3., 1., 1., 1., 1., 1., 0.],
   [2., 1., 1., 3., 1., 2., 1., 1., 1., 0.],
   [2., 1., 1., 4., 2., 2., 1., 1., 1., 1.],
   [2., 4., 1., 4., 3., 1., 1., 1., 1., 1.],
   [2., 4., 1., 4., 3., 1., 1., 1., 1., 1.],
   [2., 4., 1., 5., 2., 1., 0., 1., 1., 1.],
   [2., 4., 1., 6., 2., 1., 0., 1., 1., 1.],
   [2., 4., 1., 5., 2., 1., 1., 1., 1., 1.],
   [2., 4., 1., 5., 1., 1., 0., 1., 1., 1.],
   [2., 4., 1., 5., 3., 1., 1., 1., 1., 1.],
   [1., 4., 1., 4., 2., 1., 1., 1., 1., 1.],
   [1., 4., 2., 4., 2., 1., 1., 1., 1., 1.],
   [1., 1., 2., 3., 2., 1., 1., 1., 1., 1.],
   [1., 1., 2., 1., 2., 1., 1., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.],
   [1., 1., 1., 1., 1., 1., 0., 1., 1., 1.]])

df2 = pd.DataFrame([[0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
   [0., 1., 1., 1., 1., 1., 1., 1., 0., 0.],
   [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
   [0., 0., 1., 1., 1., 1., 1., 1., 0., 1.],
   [0., 0., 1., 1., 1., 1., 1., 1., 0., 1.],
   [0., 0., 0., 1., 1., 1., 1., 1., 1., 1.],
   [0., 1., 1., 1., 1., 1., 0., 0., 1., 1.],
   [0., 0., 1., 1., 1., 1., 1., 0., 1., 1.],
   [0., 1., 1., 1., 1., 1., 0., 0., 1., 1.],
   [0., 0., 1., 1., 1., 1., 1., 1., 0., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [1., 1., 1., 1., 1., 1., 0., 0., 0., 1.],
   [0., 1., 1., 1., 0., 0., 1., 1., 1., 1.],
   [0., 0., 1., 1., 1., 1., 0., 1., 1., 1.],
   [0., 1., 1., 1., 1., 1., 0., 0., 1., 1.],
   [1., 1., 1., 1., 0., 0., 0., 1., 1., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [1., 1., 1., 1., 1., 0., 0., 0., 1., 1.],
   [0., 1., 1., 1., 1., 1., 1., 0., 0., 1.],
   [1., 1., 1., 1., 1., 1., 0., 0., 0., 1.],
   [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.],
   [0., 1., 1., 1., 1., 1., 1., 1., 0., 0.],
   [0., 0., 1., 1., 1., 1., 1., 1., 1., 0.]] )

“最相似”可以定义为列之间公共元素的最大数量。 任何帮助或线索都值得赞赏。 我尝试了以下内容。

for key1, value1 in df1.iteritems():
#print(value)
    for key2, value2 in df2.iteritems():
        common_elements = [e for e in list(value1) if e in list(value2)]
    l = len(common_elements)

Answer 1

这是进行比赛的一种方式。

假设：

#1：如果 df1 中的列与 df2 上的列匹配，则消除这两个列以进行进一步匹配。 例如，如果 df1 中的第 1 列和第 3 列与 df2 中的第 5 列完美匹配，则只有 df1 中的第 1 列将与 df2 中的第 5 列配对。 df1 中的第 3 列将需要寻找新的匹配项。

#2: df1 和 df2 将有相同的行数来比较。 对于这个例子，我还考虑行和列的大小相同。 对代码的细微调整可以解决列中的差异，但行数必须匹配。

#3：列比较必须完全匹配。 换句话说，如果第 1 列第 1 行在 df1 中的值为 1，则 df2 的第 1 列第 1 行应该是 1。如果是，则在 row & col 上匹配。 不会重新排列数据以检查 df1 或 df2 中的匹配。

有了上面的假设，代码如下所示。

#create a list to store all the match counts
df_list = []

#iterate through df1 first
for cols1 in df1.columns:

    #convert df1 column value to a list
    x = df1[cols1].tolist()

    #iterate through df2 to match to df1 column data
    for cols2 in df2.columns:

        #convert df2 column value to a list
        y = df2[cols2].tolist()

        #iterate and compare each value in df1[col1] with df2[col2]
        #i==j will result in True or False
        #sum() will count all True values (i.e., all matched values)

        z = sum((i==j) for i,j in zip(x,y))

        #store match count, col 1, col 2 into the lsit
        df_list.append((z,cols1,cols2))

#once you have iterated through df2 for each df1
#sort the df_list by descending order of match count, ascending order of df1 column
#highest match will be first, then df1 column
df_list = sorted(df_list,key=lambda x:(-x[0],x[1]))

dfc1,dfc2,points = [],[],[]

#iterate thru df_list and pick only if df1 column and df2 column were not picked earlier
#dfc1, dfc2, points will store each matched pair

for p,c1,c2 in df_list:
    if (c1 not in dfc1) and (c2 not in dfc2):
        points.append(p)
        dfc1.append(c1)
        dfc2.append(c2)

#print the matched values

for i in range(len(dfc1)):
    print (f'{points[i]:2} rows of df1[{dfc1[i]}] matches with df2[{dfc2[i]}]')

输入数据帧 df1 和 df2 的 output 是：

24 rows of df1[7] matches with df2[3]
23 rows of df1[8] matches with df2[2]
20 rows of df1[5] matches with df2[4]
18 rows of df1[2] matches with df2[5]
17 rows of df1[6] matches with df2[9]
15 rows of df1[9] matches with df2[1]
12 rows of df1[1] matches with df2[6]
 9 rows of df1[4] matches with df2[7]
 5 rows of df1[0] matches with df2[8]
 1 rows of df1[3] matches with df2[0]

您可以决定截止（例如：考虑匹配 > 15 或更多）。 我们可以在 append 将数据添加到列表之前添加过滤器。

如何在 Python 中使用贪婪方法配对两个数据帧中最相似的列

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-02-16 06:08:47

如何在 Python 中使用贪婪方法配对两个数据帧中最相似的列

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-02-16 06:08:47

解决方案1
1 已采纳 2021-02-16 06:08:47