简体   繁体   English

Python Pandas:在循环中比较两个列表的最有效方法是什么?

[英]Python Pandas: What is the most efficient way to compare two lists in a loop?

I have a ground truth dataset 'gt' (with 100 entries) which looks like this: 我有一个基本事实数据集“ gt”(有100个条目),看起来像这样:

    org_o                  shh group
    ArabsGate               1   1
    ArabsGate Company       1   1
    AS EMT                 NaN  2
    AS EMT Mobile Internet  1   2
    DigitalEffex (MH)      NaN  3
    DigitalEffex            1   3
    Aruba S.p.A.            1   4
    Aruba S.p.              1   4

and I would like to compare it to a huge dataframe 'df' which looks like this: 我想将其与巨大的数据帧“ df”进行比较,如下所示:

        match           org_o 
        as emt        AS EMT                   
        as emt        AS EMT Mobile Internet    
        digitaleffex  DigitalEffex (MH)    
        digitaleffex  DigitalEffex
        digitaleffex  Digital

As a result of comparision I want to if the same group with the same org_o exists in my df or not. 作为比较的结果,我想知道在我的df中是否存在具有相同org_o的相同组。 So for each group both counts or members of the group, and actual org_o names. 因此,对于每个组,无论是计数还是该组的成员,以及实际的org_o名称。 So for instance where we have both 'Aruba SpA' and 'Aruba Sp' in df and wether they are matched to a same keyword ('match' column) in one group. 因此,例如,在df中同时包含“ Aruba SpA”和“ Aruba Sp”,并且它们与一组中的同一关键字(“ match”列)匹配。

Here is what I did, but is not really what I am looking for. 这是我所做的,但并不是我真正想要的。

gt.groupby('group').count()['org_o']
df.merge(gt, on  = 'org_o')

Eventually I would like to count false positive/negatives. 最终,我想算出假阳性/阴性。 this is the expected output: 这是预期的输出:

        match           org_o                 tag
        as emt        AS EMT                   TP
        as emt        AS EMT Mobile Internet   TP   
        digitaleffex  DigitalEffex (MH)        TP
        digitaleffex  DigitalEffex             TP
        digitaleffex  Digital                  FP

Can anybody help with it? 有人可以帮忙吗?

Looks like what you need is simple lookup - 看起来您需要的只是简单的查找-

df['tag'] = np.where(df['org_o'].isin(gt['org_o']), 'TP', 'FP')

Here we are adding a new column tag to the df . 在这里,我们向df添加了新的列tag We are using numpy's where function to check if the org_o in df is present in gt . 我们正在使用numpy的where函数来检查dforg_o是否存在于gt If yes, then assign TP as the value of the tag to that row, otherwise assign FP . 如果是,则将TP作为tag的值分配给该行,否则分配FP

As far as efficiency is concerned, this "lookup" is fairly efficient, because when using isin , pandas will convert the values to compare (in this case gt['org_o'] ) into a set , so the lookup time will be O(n * log m) 就效率而言,此“查找”是相当有效的,因为使用isin ,pandas会将要比较的值(在这种情况下为gt['org_o']转换为set ,所以查找时间为O( n *日志m)

Here's one way to do it. 这是一种方法。

Assign the tag column initially with 'FP' 最初为tag列分配“ FP”

In [4]: df['tag'] = 'FP'

Filter out rows with gt['org_o'] values in df['org_o'] using df['org_o'].isin(gt['org_o']) 使用df['org_o'].isin(gt['org_o'])df['org_o']过滤出具有gt['org_o']值的行

And, assign the tag column with TP 并且,将tag列分配给TP

In [5]: df.loc[df['org_o'].isin(gt['org_o']), 'tag'] = 'TP'

In [6]: df
Out[6]:
          match                   org_o tag
0        as emt                  AS EMT  TP
1        as emt  AS EMT Mobile Internet  TP
2  digitaleffex       DigitalEffex (MH)  TP
3  digitaleffex            DigitalEffex  TP
4  digitaleffex                 Digital  FP

I find @Shashank's answer elegant. 我发现@Shashank的答案很优雅。 A minor addition would be in case, if gt['org_o'] has repetitive values, you can take unique array instead. 如果gt['org_o']具有重复的值,则可以稍作添加,而可以采用唯一数组。

df['tag'] = np.where(df['org_o'].isin(gt['org_o'].unique()), 'TP', 'FP')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:比较两个整数列表的最有效方法 - Python: Most efficient way to compare two lists of integers 比较列表的最有效方法 Python - Most efficient way to compare lists Python 在 python 中循环遍历列表的最有效方法是什么? - What is the most efficient way to loop through lists in python? 以最有效的方式比较两个 pandas DataFrame - Compare two pandas DataFrames in the most efficient way Python - 比较两个字符串/列表中“正确”顺序排序的单词#的最有效方法 - Python - Most efficient way to compare # of words sequenced in “right” order across two strings/lists Python3:计算两个列表总和为100的所有排列的最有效方法是什么? - Python3: What is the most efficient way to calculate all permutations of two lists summing to 100? 使用两个列表创建dict的最有效方法是什么? - what is the most efficient way to creating a dict with two lists? 计算两个列表字典之间的相似度的最有效方法是什么? - What is the most efficient way of computing similarity between two dictionnaries of lists? 在 Python 中生成和 zip 两个列表的最干净有效的方法 - Most clean and efficient way to generate and zip two lists in Python 比较Python中两个几乎相同的CSV的最有效方法? - Most efficient way to compare two near identical CSV's in Python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM