简体   繁体   English

检查字符串(或拆分字符串)是否包含列表中的任何单词

[英]Check if string (or split string) contains any words from list

I have a Twitter bot that needs to ignore tweets that contain certain blacklisted words. 我有一个Twitter机器人,需要忽略包含某些列入黑名单的单词的推文。

This works, but only if the words in the tweet are exactly as they're seen in the list of blacklisted words. 这是可行的,但前提是推文中的单词与在黑名单中的单词完全相同。

timeline = filter(lambda status: not any(word in status.text.split() for word in wordBlacklist), timeline)

I want to make sure that tweets can't bypass this by putting symbols or adding additional characters around a word, such as bypassing blacklisted word "face" by appending "book" to the end of it, like so "facebook". 我想确保推文不能通过在单词周围放置符号或添加其他字符来绕过此操作,例如通过在其末尾附加“ book”(如“ facebook”)来绕过列入黑名单的单词“ face”。

How do I do this in a way that fits within my filter's lambda? 如何以适合过滤器lambda的方式执行此操作?

You can make use of re here. 您可以在这里使用re

import re
timeline = filter(lambda status: not any(re.findall(r"[a-zA-Z0-9]*"+word+r"[a-zA-Z0-9]*",status.text) for word in wordBlacklist), timeline)

You can also use re.escape() over word if word can contain some escape characters 如果word可以包含一些转义字符,也可以在word使用re.escape()

If you expect symbols as well ,try 如果您也希望使用symbols ,请尝试

timeline = filter(lambda status: not any(re.findall(r"\S*"+word+r"\S*",status.text) for word in wordBlacklist), timeline)

You can construct a regular expression based on the blacklist: 您可以根据黑名单构造一个正则表达式:

from itertools import ifilterfalse
import re

wordBlacklist = ['face', 'hello']

r = re.compile('|'.join(map(re.escape, wordBlacklist)))

...
timeline = list(ifilterfalse(lambda status: r.search(status.text), timeline))

Instead of filter, you can use a list comprehension, which is the same idea with a slightly different syntax, and then use regular expressions for the filtering, as your example is beyond the capabilities of string operations: 可以使用列表推导(而不是过滤器)来代替过滤器,这是相同的主意,但语法略有不同,然后使用正则表达式进行过滤,因为您的示例超出了字符串操作的能力:

import re
blacklist = re.compile('face|friend|advertisement')
timeline = [word for word in status.split() if not blacklist.search(word)]
# filter version of this command:
timeline = filter(lambda word: not blacklist.search(word), status.split())

Now timeline will return a list of words that don't have any match to your blacklist within them, so "facebook" would be blocked because it matches "face", "friendly" would be blocked because it contains "friend", etc. However, you are going to need to get fancier for things like "f*acebook" or other tricks-- these would bypass the filter currently. 现在,时间轴将返回一个列表,该列表与其中的黑名单没有任何匹配,因此“ facebook”将被阻止,因为它与“ face”相匹配;“ Friendly”将被阻止,因为它包含“ friend”,等等。但是,您将需要对“ f * acebook”之类的东西或其他技巧变得更加好奇-这些目前将绕过过滤器。 Try out regex and get comfortable with them, and you can really make pretty fancy filters. 试用正则表达式并使其适应,您实际上可以制作出漂亮的滤镜。 Here is a good practice site for regex. 这是正则表达式的良好实践站点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM