![](/img/trans.png)
[英]Most efficient way to check if any substrings in list are in another list of strings
[英]Efficient way to check if every string in a list appear in another list of strings
我有 2 个大量的字符串列表:
一种具有化学物质名称的物质(如 10K 化学物质):
chemicals_list = ["chemical1", "chemical2", ..., "chemical100000"]
另一个包含文章摘要(如 50M 摘要):
abstracts_list = ["abstract1 is very very very long", "abstract2 is very very very VERY long", ..., "abstract50000000 is pretty long as well"]
我需要创建一个频率字典,将化学物质列表中的每种chemicals_list
物质映射到它出现在多少摘要中。
目前我有 2 个 for 循环,但这需要永远:
frequency_dict = {}
for c in chemicals_list:
exact_entity = f' {c} ' # make sure it's the exact entity since it can appear as a substring (e.g., "pen" rather then "penicilin")
for abstract_text in abstracts_list:
if exact_entity in abstract_text:
if c in frequency_dict.keys():
frequency_dict[c] += 1
else:
frequency_dict[c] = 1
有没有更有效的方法来做到这一点? 如果有帮助,我可以使用 GPU
我通过以下方式优化了您的代码:
import random
import string
import time
chemicals_list = ["chemical1", "chemical2", ..., "chemical100000"]
abstracts_list = ["abstract1 is very very very long", "abstract2 is very very very VERY long", ..., "abstract50000000 is pretty long as well"]
frequency_dict = {}
text = ' '.join(abstracts_list) # make one big string
for c in chemicals_list:
# you might want to consider fuzzy word matching (see fuzzywuzzy python lib)
frequency_dict[c] = text.count(c)
我在本地运行了一个测试,我看到我的测试用例的速度有了很大的提高。 如果您想要性能,最好避免 python 循环。 甚至可能存在一些 function 因此不需要 python 循环,但我并没有真正搜索。 还可以尝试使用 numpy/scipy,以便可以使用预编译的 c 函数。 当你尝试了所有这些之后,你就可以开始考虑多线程了。
还可以考虑将其发布在https://codereview.stackexchange.com/上更适合要求审查/改进。
您可以使用collections.Counter
和generator expression
class collections.Counter([iterable-or-mapping])
Counter
是用于计算可散列对象的dict
子类。 它是一个集合,其中元素存储为字典键,它们的计数存储为字典值。 计数可以是任何 integer 值,包括零计数或负计数。Counter
class 类似于其他语言中的 bag 或 multisets。
exact_identities = map(lambda x:f' {x} ',chemical_list)
#note that abs_list != abs_text_conditioned
abs_texts_conditioned = (abstract_text
for exact_identity in exact_identities
for abstract_text in abstracts_list
if exact_entity in abstract_text)
freq_counted = Counter(abs_text_conditioned)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.