如何在Python中使用正则表达式匹配列表引用？

Question

I have a list of strings from which I need to remove all elements that match a substring from another list. 我有一个字符串列表，我需要从中删除与另一个列表中的子字符串匹配的所有元素。 I am trying to do this with lists, nested loops, and regex. 我试图用列表，嵌套循环和正则表达式来做这件事。

The output from the following snippet produces ["We don't", "need no", "education"] instead of the desired ["education"]. 以下代码段的输出产生[“我们不”，“不需要”，“教育”]而不是所需的[“教育”]。 I'm new to Python and this is my first experiment with regex, and I'm stuck on the sytax. 我是Python的新手，这是我第一个使用正则表达式的实验，而且我坚持使用sytax。

import re

testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
dellist = []

for x in range(len(testfile)):
    for y in range(len(stopwords)):
        if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):
            dellist.append(testfile[x])

for x in range(len(dellist)):
    if dellist[x] in testfile:
        del testfile[testfile.index(dellist[x])]

print testfile

The line 这条线

if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):

returns "None" for all iterations through the loop, so I'm guessing this is where my problem lies... 对于循环中的所有迭代都返回“None”，所以我猜这是我的问题所在......

Answer 1

It's because re.match tests for a match from the start of the string. 这是因为re.match从字符串的开头测试匹配。

Try re.search instead. 请尝试re.search 。 Also, you're missing the r on your second '\\b' : 另外，你错过了第二个'\\b'上的r ：

if re.search(r'\b' + stopwords[y] + r'\b', testfile[x], re.I):

Also, you could just use list comprehension to build up dellist (you could probably use list comprehension to build up the new testfile entirely, but it escapes me at the moment): 另外，你可以只使用列表理解建立dellist （你很可能使用列表理解建立新testfile全部，但我想不起来了的那一刻）：

dellist = [w for w in testfile for test in stopwords if re.search(test,w,re.I)]

Another thought - since you're using re module anyway, why don't you combine your stopwords into \\b(We|no)\\b and then you can just test testfile against the one regex? 另一个想法 - 既然你正在使用re模块，你为什么不把你的stopwords合成\\b(We|no)\\b然后你可以只针对一个正则表达式测试testfile ？

regex = r'\b(' + '|'.join(stopwords) + r')\b'  # r'\b(We|no)\b'

Now you just have to look for words that don't match that regex: 现在你只需要查找与正则表达式不匹配的单词：

newtestfile = [w for w in testfile if re.search(regex,w,re.I) is None]
# newtestfile is ['education']

Answer 2

Why not just use the basic in operator? 为什么不只使用基本in运算符？ Should be considerably faster than the regex too. 应该比正则表达式快得多。

for line in testfile:
    for word in stopwords:
        if word in line:
            do stuff

Or, how about a nifty list comprehension ;) 或者，一个漂亮的列表理解怎么样;）

[line for line in testfile if not [word for word in stopwords if word in line]]

Answer 3

Prettier with in instead of regex's but the examples above would break if the stopword was contained within another word. 与漂亮in ，而不是正则表达式的，但上面的例子将打破，如果禁用词包含了一个字内。 This example only matches on complete words: 此示例仅匹配完整的单词：

testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
output = []

for sentence in testfile:
    bad = false

    for word in sentence.split(' '):
        if word in stopwords:
            bad = true
            break

    if not bad:
        output.append(sentence)

如何在Python中使用正则表达式匹配列表引用？

问题描述

3 个解决方案

解决方案1
1 已采纳 2012-03-13 00:49:46

解决方案2
1 2012-03-13 00:49:46

解决方案3
1 2012-03-13 01:06:09

如何在Python中使用正则表达式匹配列表引用？

问题描述

3 个解决方案

解决方案1 1 已采纳 2012-03-13 00:49:46

解决方案2 1 2012-03-13 00:49:46

解决方案3 1 2012-03-13 01:06:09

解决方案1
1 已采纳 2012-03-13 00:49:46

解决方案2
1 2012-03-13 00:49:46

解决方案3
1 2012-03-13 01:06:09