简体   繁体   English

不要将单词边界beetwen括号与python正则表达式匹配

[英]Do not match word boundary beetwen parenthesis with python regex

I actually have: 我实际上有:

 regex = r'\bon the\b'

but need my regex to match only if this keyword (actually "on the") is not between parentheses in the text: 但是仅当此关键字(实际上是“ on the”上)不在文本括号之间时,才需要我的正则表达式匹配:

should match: 应该匹配:

john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)

should not match: 不应该匹配:

(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)

在UNIX中,使用以下正则表达式的grep实用程序就足够了,

grep " on the " input_file_name | grep -v "\(.* on the .*\)"

How about something like this: ^(.*)(?:\\(.*\\))(.*)$ see it in action . 像这样的事情怎么样: ^(.*)(?:\\(.*\\))(.*)$ 看到它的实际效果

As you requested, it "matches only words that are not between parentheses in the text" 根据您的要求,它“仅匹配不在文本括号内的单词”

So, from: 因此,来自:

some text (more text in parentheses) and some not in parentheses 一些文本(括号中有更多文本),有些则不在括号中

Matches: some text + and some not in parentheses 匹配项: some text + and some not in parentheses

More examples at the link above. 上方链接提供了更多示例。


EDIT : changing answer since the question was changed. 编辑 :由于问题已更改,因此更改了答案。

To capture all mentions not within parentheses I'd use some code instead of a huge regex. 为了捕获不在括号内的所有提及,我将使用一些代码而不是大量的正则表达式。

Something like this will get you close: 这样的事情会让您接近:

import re

pattern = r"(on the)"

test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''

match_list = test_text.split('\n')

for line in match_list:
    print line, "->",

    bracket_pattern = r"(\(.*\))" #remove everything between ()
    brackets = re.findall(bracket_pattern, line)
    for match in brackets:
        line = line.replace(match,"")

    matches = re.findall(pattern, line)
    for match in matches:
        print match

    print "\r"

Output: 输出:

john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach -> 
bob is at the pool (berkeley) -> 
the spon (is on the table) -> 

I don't think that regex would help you here for a general case. 我认为正则表达式在一般情况下不会帮助您。 for your examples, this regex would work as you want it to: 对于您的示例,此正则表达式将按您希望的方式工作:

((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])

description: 描述:

(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below 
                 can be matched
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally
    .{3} matches any character (except newline)
        Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below 
                can be matched
    .{3} matches any character (except newline)
        Quantifier: Exactly 2 times
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally

if you want to generalize the problem to any string between the parentheses and the string you are searching for, this will not work with this regex. 如果要将问题推广到括号和要搜索的字符串之间的任何字符串,则此正则表达式将不起作用。 the issue is the length of that string between parentheses and your string. 问题是括号和您的字符串之间的字符串长度 In regex the Lookbehind quantifiers are not allowed to be indefinite. 在正则表达式中,不允许Lookbehind量词是不确定的。

In my regex I used positive Lookahead and positive Lookbehind, the same result could be achieved as well with negative ones, but the issue remains. 在我的正则表达式中,我使用肯定的Lookahead和肯定的Lookbehind,使用否定的结果也可以实现相同的结果,但是问题仍然存在。

Suggestion: write a small python code which can check a whole line if it contain your text not between parentheses, as regex alone can't do the job. 建议:编写一个小的python代码,如果不包含括号之间的文本,则可以检查整行,因为仅regex不能胜任。

example: 例:

import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
    for item in unWanted:
        if item in line:
            mylist.remove(line)
# look for what you want
for line in mylist:
    if mystr in line:
        print line

where: 哪里:

mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.

Hope this helped. 希望这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM