[英]Python Pandas: What is the most efficient way to compare two lists in a loop?
I have a ground truth dataset 'gt' (with 100 entries) which looks like this: 我有一个基本事实数据集“ gt”(有100个条目),看起来像这样:
org_o shh group
ArabsGate 1 1
ArabsGate Company 1 1
AS EMT NaN 2
AS EMT Mobile Internet 1 2
DigitalEffex (MH) NaN 3
DigitalEffex 1 3
Aruba S.p.A. 1 4
Aruba S.p. 1 4
and I would like to compare it to a huge dataframe 'df' which looks like this: 我想将其与巨大的数据帧“ df”进行比较,如下所示:
match org_o
as emt AS EMT
as emt AS EMT Mobile Internet
digitaleffex DigitalEffex (MH)
digitaleffex DigitalEffex
digitaleffex Digital
As a result of comparision I want to if the same group with the same org_o exists in my df or not. 作为比较的结果,我想知道在我的df中是否存在具有相同org_o的相同组。 So for each group both counts or members of the group, and actual org_o names. 因此,对于每个组,无论是计数还是该组的成员,以及实际的org_o名称。 So for instance where we have both 'Aruba SpA' and 'Aruba Sp' in df and wether they are matched to a same keyword ('match' column) in one group. 因此,例如,在df中同时包含“ Aruba SpA”和“ Aruba Sp”,并且它们与一组中的同一关键字(“ match”列)匹配。
Here is what I did, but is not really what I am looking for. 这是我所做的,但并不是我真正想要的。
gt.groupby('group').count()['org_o']
df.merge(gt, on = 'org_o')
Eventually I would like to count false positive/negatives. 最终,我想算出假阳性/阴性。 this is the expected output: 这是预期的输出:
match org_o tag
as emt AS EMT TP
as emt AS EMT Mobile Internet TP
digitaleffex DigitalEffex (MH) TP
digitaleffex DigitalEffex TP
digitaleffex Digital FP
Can anybody help with it? 有人可以帮忙吗?
Looks like what you need is simple lookup - 看起来您需要的只是简单的查找-
df['tag'] = np.where(df['org_o'].isin(gt['org_o']), 'TP', 'FP')
Here we are adding a new column tag
to the df
. 在这里,我们向df
添加了新的列tag
。 We are using numpy's where function to check if the org_o
in df
is present in gt
. 我们正在使用numpy的where函数来检查df
的org_o
是否存在于gt
。 If yes, then assign TP
as the value of the tag
to that row, otherwise assign FP
. 如果是,则将TP
作为tag
的值分配给该行,否则分配FP
。
As far as efficiency is concerned, this "lookup" is fairly efficient, because when using isin
, pandas will convert the values to compare (in this case gt['org_o']
) into a set , so the lookup time will be O(n * log m) 就效率而言,此“查找”是相当有效的,因为使用isin
,pandas会将要比较的值(在这种情况下为gt['org_o']
) 转换为set ,所以查找时间为O( n *日志m)
Here's one way to do it. 这是一种方法。
Assign the tag
column initially with 'FP' 最初为tag
列分配“ FP”
In [4]: df['tag'] = 'FP'
Filter out rows with gt['org_o']
values in df['org_o']
using df['org_o'].isin(gt['org_o'])
使用df['org_o'].isin(gt['org_o'])
在df['org_o']
过滤出具有gt['org_o']
值的行
And, assign the tag
column with TP
并且,将tag
列分配给TP
In [5]: df.loc[df['org_o'].isin(gt['org_o']), 'tag'] = 'TP'
In [6]: df
Out[6]:
match org_o tag
0 as emt AS EMT TP
1 as emt AS EMT Mobile Internet TP
2 digitaleffex DigitalEffex (MH) TP
3 digitaleffex DigitalEffex TP
4 digitaleffex Digital FP
I find @Shashank's answer elegant. 我发现@Shashank的答案很优雅。 A minor addition would be in case, if gt['org_o']
has repetitive values, you can take unique array instead. 如果gt['org_o']
具有重复的值,则可以稍作添加,而可以采用唯一数组。
df['tag'] = np.where(df['org_o'].isin(gt['org_o'].unique()), 'TP', 'FP')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.