检查字符串是否包含列表中字符串的最快方法

Question

如果一个字符串包含另一个基于列表的字符串，哪种方法是最快的搜索方法？

这个工作正常，但是当字符串很大并且列表很长时对我来说太慢了。

test_string = "Hello! This is a test. I love to eat apples."

fruits = ['apples', 'oranges', 'bananas'] 

for fruit in fruits:
    if fruit in test_string:
        print(fruit+" contains in the string")

Answer 1

为此，我建议首先使用RegexpTokenizer标记字符串以删除所有特殊字符，然后使用sets来查找交集：

from nltk.tokenize import RegexpTokenizer
test_string = "Hello! This is a test. I love to eat apples."

tokenizer = RegexpTokenizer(r'\w+')
test_set = set(tokenizer.tokenize(test_string))
# {'Hello', 'I', 'This', 'a', 'apples', 'eat', 'is', 'love', 'test', 'to'}

对字符串进行标记并构造一个集合后，找到set.intersection ：

set(['apples', 'oranges', 'bananas']) & test_set
# {'apples'}

Answer 2

是的。 你可以像这样减少你的迭代：

print(any(fruit in frozenset(test_string.replace('.',' ').lower().split()) for fruit in fruits))

Answer 3

使用in运算符时，集合可能是您提高速度的最佳选择。

为了构建一个只包含单词的集合，我们需要：

1）从字符串中删除标点符号；

2）将字符串拆分为空格。

对于删除标点符号，这个答案可能有最快的解决方案（使用str.makestrans和string.punctuation ）。

这是使用您的测试用例的示例：

import string

test_string = "Hello! This is a test. I love to eat apples."
test_string_no_punctuation = test_string.translate(str.maketrans('', '', string.punctuation))
word_set = set(test_string_no_punctuation.split())

fruits = ['apples', 'oranges', 'bananas'] 

for fruit in fruits:
    if fruit in word_set:
        print(fruit+" contains in the string")

您可能希望将删除标点符号 + 将字符串拆分为 function 的详细操作包装起来：

def word_set(input_string):
    return set(input_string.translate(str.maketrans('', '', string.punctuation)).split())

Answer 4

文本通常比您要搜索的单词列表大。


for fruit in fruits:
    if fruit in test_string:
        print(fruit+" contains in the string")

这确实是低效的，因为您实际上是在遍历水果列表中每个水果的整个文本，对于短句可能不是问题，但如果您搜索长文本，则此过程将花费更长的时间。

一种更好的方法是遍历文本一次，并沿途捕获水果列表中的所有单词。

Answer 5

如果您只对存在单词感兴趣：

>>> words = set(test_string.replace('.',' ').lower().split())
>>> any(fruit in words for fruit in fruits)
True

您当然可以遍历每个水果以检查水果蛋糕中可以找到哪些水果。 因此，您可以在循环示例if fruit in test_string更改为if fruit in words 。

Answer 6

你可以这样做：

import re

fruits = ['apples', 'oranges', 'bananas']
test_string = "Hello! This is a test. I love to eat apples."

basket = set(fruits)
words = re.compile('\w+')

for match in words.finditer(test_string):
    fruit = match.group()
    if fruit in basket:
        print(fruit + " contains in the string")

Output

apples contains in the string

检查字符串是否包含列表中字符串的最快方法

问题描述

6 个解决方案

解决方案1
8 已采纳 2019-10-04 14:22:22

解决方案2
2 2019-10-04 15:31:21

解决方案3
1 2019-10-04 14:25:31

解决方案4
1 2019-10-04 14:30:39

解决方案5
0 2019-10-04 14:40:30

解决方案6
0 2019-10-04 15:19:35

检查字符串是否包含列表中字符串的最快方法

问题描述

6 个解决方案

解决方案1 8 已采纳 2019-10-04 14:22:22

解决方案2 2 2019-10-04 15:31:21

解决方案3 1 2019-10-04 14:25:31

解决方案4 1 2019-10-04 14:30:39

解决方案5 0 2019-10-04 14:40:30

解决方案6 0 2019-10-04 15:19:35

解决方案1
8 已采纳 2019-10-04 14:22:22

解决方案2
2 2019-10-04 15:31:21

解决方案3
1 2019-10-04 14:25:31

解决方案4
1 2019-10-04 14:30:39

解决方案5
0 2019-10-04 14:40:30

解决方案6
0 2019-10-04 15:19:35