[英]How to I regex match a list reference in Python?
I have a list of strings from which I need to remove all elements that match a substring from another list. 我有一个字符串列表,我需要从中删除与另一个列表中的子字符串匹配的所有元素。 I am trying to do this with lists, nested loops, and regex.
我试图用列表,嵌套循环和正则表达式来做这件事。
The output from the following snippet produces ["We don't", "need no", "education"] instead of the desired ["education"]. 以下代码段的输出产生[“我们不”,“不需要”,“教育”]而不是所需的[“教育”]。 I'm new to Python and this is my first experiment with regex, and I'm stuck on the sytax.
我是Python的新手,这是我第一个使用正则表达式的实验,而且我坚持使用sytax。
import re
testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
dellist = []
for x in range(len(testfile)):
for y in range(len(stopwords)):
if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):
dellist.append(testfile[x])
for x in range(len(dellist)):
if dellist[x] in testfile:
del testfile[testfile.index(dellist[x])]
print testfile
The line 这条线
if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):
returns "None" for all iterations through the loop, so I'm guessing this is where my problem lies... 对于循环中的所有迭代都返回“None”,所以我猜这是我的问题所在......
It's because re.match
tests for a match from the start of the string. 这是因为
re.match
从字符串的开头测试匹配。
Try re.search
instead. 请尝试
re.search
。 Also, you're missing the r
on your second '\\b'
: 另外,你错过了第二个
'\\b'
上的r
:
if re.search(r'\b' + stopwords[y] + r'\b', testfile[x], re.I):
Also, you could just use list comprehension to build up dellist
(you could probably use list comprehension to build up the new testfile
entirely, but it escapes me at the moment): 另外,你可以只使用列表理解建立
dellist
(你很可能使用列表理解建立新testfile
全部,但我想不起来了的那一刻):
dellist = [w for w in testfile for test in stopwords if re.search(test,w,re.I)]
Another thought - since you're using re
module anyway, why don't you combine your stopwords
into \\b(We|no)\\b
and then you can just test testfile
against the one regex? 另一个想法 - 既然你正在使用
re
模块,你为什么不把你的stopwords
合成\\b(We|no)\\b
然后你可以只针对一个正则表达式测试testfile
?
regex = r'\b(' + '|'.join(stopwords) + r')\b' # r'\b(We|no)\b'
Now you just have to look for words that don't match that regex: 现在你只需要查找与正则表达式不匹配的单词:
newtestfile = [w for w in testfile if re.search(regex,w,re.I) is None]
# newtestfile is ['education']
Why not just use the basic in
operator? 为什么不只使用基本
in
运算符? Should be considerably faster than the regex too. 应该比正则表达式快得多。
for line in testfile:
for word in stopwords:
if word in line:
do stuff
Or, how about a nifty list comprehension ;) 或者,一个漂亮的列表理解怎么样;)
[line for line in testfile if not [word for word in stopwords if word in line]]
Prettier with in
instead of regex's but the examples above would break if the stopword was contained within another word. 与漂亮
in
,而不是正则表达式的,但上面的例子将打破,如果禁用词包含了一个字内。 This example only matches on complete words: 此示例仅匹配完整的单词:
testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
output = []
for sentence in testfile:
bad = false
for word in sentence.split(' '):
if word in stopwords:
bad = true
break
if not bad:
output.append(sentence)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.