Python Pandas：在循环中比较两个列表的最有效方法是什么？

Question

I have a ground truth dataset 'gt' (with 100 entries) which looks like this: 我有一个基本事实数据集“ gt”（有100个条目），看起来像这样：

    org_o                  shh group
    ArabsGate               1   1
    ArabsGate Company       1   1
    AS EMT                 NaN  2
    AS EMT Mobile Internet  1   2
    DigitalEffex (MH)      NaN  3
    DigitalEffex            1   3
    Aruba S.p.A.            1   4
    Aruba S.p.              1   4

and I would like to compare it to a huge dataframe 'df' which looks like this: 我想将其与巨大的数据帧“ df”进行比较，如下所示：

        match           org_o 
        as emt        AS EMT                   
        as emt        AS EMT Mobile Internet    
        digitaleffex  DigitalEffex (MH)    
        digitaleffex  DigitalEffex
        digitaleffex  Digital

As a result of comparision I want to if the same group with the same org_o exists in my df or not. 作为比较的结果，我想知道在我的df中是否存在具有相同org_o的相同组。 So for each group both counts or members of the group, and actual org_o names. 因此，对于每个组，无论是计数还是该组的成员，以及实际的org_o名称。 So for instance where we have both 'Aruba SpA' and 'Aruba Sp' in df and wether they are matched to a same keyword ('match' column) in one group. 因此，例如，在df中同时包含“ Aruba SpA”和“ Aruba Sp”，并且它们与一组中的同一关键字（“ match”列）匹配。

Here is what I did, but is not really what I am looking for. 这是我所做的，但并不是我真正想要的。

gt.groupby('group').count()['org_o']
df.merge(gt, on  = 'org_o')

Eventually I would like to count false positive/negatives. 最终，我想算出假阳性/阴性。 this is the expected output: 这是预期的输出：

        match           org_o                 tag
        as emt        AS EMT                   TP
        as emt        AS EMT Mobile Internet   TP   
        digitaleffex  DigitalEffex (MH)        TP
        digitaleffex  DigitalEffex             TP
        digitaleffex  Digital                  FP

Can anybody help with it? 有人可以帮忙吗？

Answer 1

Looks like what you need is simple lookup - 看起来您需要的只是简单的查找-

df['tag'] = np.where(df['org_o'].isin(gt['org_o']), 'TP', 'FP')

Here we are adding a new column tag to the df . 在这里，我们向df添加了新的列tag 。 We are using numpy's where function to check if the org_o in df is present in gt . 我们正在使用numpy的where函数来检查df的org_o是否存在于gt 。 If yes, then assign TP as the value of the tag to that row, otherwise assign FP . 如果是，则将TP作为tag的值分配给该行，否则分配FP 。

As far as efficiency is concerned, this "lookup" is fairly efficient, because when using isin , pandas will convert the values to compare (in this case gt['org_o'] ) into a set , so the lookup time will be O(n * log m) 就效率而言，此“查找”是相当有效的，因为使用isin ，pandas会将要比较的值（在这种情况下为gt['org_o'] ）转换为set ，所以查找时间为O（ n *日志m）

Answer 2

Here's one way to do it. 这是一种方法。

Assign the tag column initially with 'FP' 最初为tag列分配“ FP”

In [4]: df['tag'] = 'FP'

Filter out rows with gt['org_o'] values in df['org_o'] using df['org_o'].isin(gt['org_o']) 使用df['org_o'].isin(gt['org_o'])在df['org_o']过滤出具有gt['org_o']值的行

And, assign the tag column with TP 并且，将tag列分配给TP

In [5]: df.loc[df['org_o'].isin(gt['org_o']), 'tag'] = 'TP'

In [6]: df
Out[6]:
          match                   org_o tag
0        as emt                  AS EMT  TP
1        as emt  AS EMT Mobile Internet  TP
2  digitaleffex       DigitalEffex (MH)  TP
3  digitaleffex            DigitalEffex  TP
4  digitaleffex                 Digital  FP

I find @Shashank's answer elegant. 我发现@Shashank的答案很优雅。 A minor addition would be in case, if gt['org_o'] has repetitive values, you can take unique array instead. 如果gt['org_o']具有重复的值，则可以稍作添加，而可以采用唯一数组。

df['tag'] = np.where(df['org_o'].isin(gt['org_o'].unique()), 'TP', 'FP')

Python Pandas：在循环中比较两个列表的最有效方法是什么？

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-04-21 19:25:35

解决方案2
1 2015-04-21 19:24:13

Python Pandas：在循环中比较两个列表的最有效方法是什么？

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-04-21 19:25:35

解决方案2 1 2015-04-21 19:24:13

解决方案1
2 已采纳 2015-04-21 19:25:35

解决方案2
1 2015-04-21 19:24:13