简体   繁体   English

如何获取两个数据框列之间的交集项?

[英]How to get the intersection item between two dataframe columns?

[Image example] [图像示例]
在此处输入图片说明

As shown in picture above, how can I find the total number of the item that appeared in both 'Actual' and 'prediction' columns for every userId?如上图所示,如何找到每个 userId 的“实际”和“预测”列中出现的项目总数? The type is pandas.core.frame.DataFrame.类型是pandas.core.frame.DataFrame。

The code to construct the example table as following:构建示例表的代码如下:

import pandas as pd
import numpy as np

# initialize list of lists 
data = pd.DataFrame(np.array([[32, 256, 5, 102, 74, 171, 270, 111, 209, 24],
                [1, 258, 257, 281, 10, 269, 14, 13, 272, 273],
                [258, 260, 264, 11, 271, 288, 294, 300, 301],
                [9, 10, 11, 12, 22, 28],
                [1, 514, 2, 516, 4, 13, 526, 527, 1037, 529, 256, 678],
                [1, 1028, 7, 9, 1033, 15, 1047, 25, 546, 1061],
                [258, 259, 514, 261, 131, 135, 520, 265, 1028, 50],
                [2, 11, 12, 526, 1044, 22, 23, 27, 541, 54, 88],
                [332, 168, 79, 343, 38, 1007, 9, 232, 381, 1079],
                [38, 168, 561, 542, 69, 20, 79, 385, 332, 480]]))

test_actual = data.rename(columns={0: "Actual"})
test_actual['userId'] = [1,2,3,5,6,8,10,12,15,18]
test_actual = test_actual.set_index('userId')

data2 = [[154, 248, 237, 223, 83, 283, 69, 32, 480, 325],
         [332, 168, 38, 9, 385, 258, 561, 41, 79, 542],
         [322, 258, 226, 232, 1007, 343, 332, 260, 561, 381],
         [237, 154, 196, 223, 523, 277, 226, 748, 323, 28],
         [168, 332, 38, 9, 83, 561, 232, 526, 1007, 20],
         [79, 38, 480, 168, 232, 561, 653, 9, 542, 996],
         [9, 232, 332, 523, 168, 322, 7, 1028, 41, 542],
         [83, 168, 232, 322, 385, 223, 154, 941, 283, 12], 
         [69, 38, 196, 480, 83, 385, 20, 343, 283, 542], 
         [480, 38, 69, 83, 385, 154, 542, 941, 283, 223]]

test_actual['Predict'] = data2
test_actual

Your opinion and help will be much much appreciated!您的意见和帮助将不胜感激! Thank you!谢谢!

Without further details, eg, how many classes, how long the dataset, apply seems to be the only viable choice:如果没有进一步的细节,例如,多少类,数据集多久, apply似乎是唯一可行的选择:

(test_actual
   .apply(lambda x: set(x['Actual']).intersection(set(x['Predict'])),
                               axis=1)
)

Output:输出:

userId
1                        {32}
2                       {258}
3                  {258, 260}
5                        {28}
6                       {526}
8                         {9}
10                     {1028}
12                       {12}
15                  {38, 343}
18    {480, 385, 69, 38, 542}
dtype: object

IIUC, You can use numpy intersect1d, IIUC,您可以使用 numpy intersect1d,

test_actual.apply(lambda x: len(np.intersect1d(x['Actual'],x['Predict'])), axis = 1)

userId
1     1
2     1
3     2
5     1
6     1
8     1
10    1
12    1
15    2
18    5

If you are interested in values and not the count, use如果您对值而不是计数感兴趣,请使用

test_actual.apply(lambda x: np.intersect1d(x['Actual'],x['Predict']), axis = 1)

userId
1                        [32]
2                       [258]
3                  [258, 260]
5                        [28]
6                       [526]
8                         [9]
10                     [1028]
12                       [12]
15                  [38, 343]
18    [38, 69, 385, 480, 542]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM