简体   繁体   English

如何在Python中使用正则表达式匹配列表引用?

[英]How to I regex match a list reference in Python?

I have a list of strings from which I need to remove all elements that match a substring from another list. 我有一个字符串列表,我需要从中删除与另一个列表中的子字符串匹配的所有元素。 I am trying to do this with lists, nested loops, and regex. 我试图用列表,嵌套循环和正则表达式来做这件事。

The output from the following snippet produces ["We don't", "need no", "education"] instead of the desired ["education"]. 以下代码段的输出产生[“我们不”,“不需要”,“教育”]而不是所需的[“教育”]。 I'm new to Python and this is my first experiment with regex, and I'm stuck on the sytax. 我是Python的新手,这是我第一个使用正则表达式的实验,而且我坚持使用sytax。

import re

testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
dellist = []

for x in range(len(testfile)):
    for y in range(len(stopwords)):
        if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):
            dellist.append(testfile[x])

for x in range(len(dellist)):
    if dellist[x] in testfile:
        del testfile[testfile.index(dellist[x])]

print testfile

The line 这条线

if re.match(r'\b' + stopwords[y] + '\b', testfile[x], re.I):

returns "None" for all iterations through the loop, so I'm guessing this is where my problem lies... 对于循环中的所有迭代都返回“None”,所以我猜这是我的问题所在......

It's because re.match tests for a match from the start of the string. 这是因为re.match从字符串的开头测试匹配。

Try re.search instead. 请尝试re.search Also, you're missing the r on your second '\\b' : 另外,你错过了第二个'\\b'上的r

if re.search(r'\b' + stopwords[y] + r'\b', testfile[x], re.I):

Also, you could just use list comprehension to build up dellist (you could probably use list comprehension to build up the new testfile entirely, but it escapes me at the moment): 另外,你可以只使用列表理解建立dellist (你很可能使用列表理解建立新testfile全部,但我想不起来了的那一刻):

dellist = [w for w in testfile for test in stopwords if re.search(test,w,re.I)]

Another thought - since you're using re module anyway, why don't you combine your stopwords into \\b(We|no)\\b and then you can just test testfile against the one regex? 另一个想法 - 既然你正在使用re模块,你为什么不把你的stopwords合成\\b(We|no)\\b然后你可以只针对一个正则表达式测试testfile

regex = r'\b(' + '|'.join(stopwords) + r')\b'  # r'\b(We|no)\b'

Now you just have to look for words that don't match that regex: 现在你只需要查找与正则表达式匹配的单词:

newtestfile = [w for w in testfile if re.search(regex,w,re.I) is None]
# newtestfile is ['education']

Why not just use the basic in operator? 为什么不只使用基本in运算符? Should be considerably faster than the regex too. 应该比正则表达式快得多。

for line in testfile:
    for word in stopwords:
        if word in line:
            do stuff

Or, how about a nifty list comprehension ;) 或者,一个漂亮的列表理解怎么样;)

[line for line in testfile if not [word for word in stopwords if word in line]]

Prettier with in instead of regex's but the examples above would break if the stopword was contained within another word. 与漂亮in ,而不是正则表达式的,但上面的例子将打破,如果禁用词包含了一个字内。 This example only matches on complete words: 此示例仅匹配完整的单词:

testfile = ["We don't", "need no", "education"]
stopwords = ["We", "no"]
output = []

for sentence in testfile:
    bad = false

    for word in sentence.split(' '):
        if word in stopwords:
            bad = true
            break

    if not bad:
        output.append(sentence)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM