簡體   English   中英

不要將單詞邊界beetwen括號與python正則表達式匹配

[英]Do not match word boundary beetwen parenthesis with python regex

我實際上有:

 regex = r'\bon the\b'

但是僅當此關鍵字(實際上是“ on the”上)不在文本括號之間時,才需要我的正則表達式匹配:

應該匹配:

john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)

不應該匹配:

(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)

在UNIX中,使用以下正則表達式的grep實用程序就足夠了,

grep " on the " input_file_name | grep -v "\(.* on the .*\)"

像這樣的事情怎么樣: ^(.*)(?:\\(.*\\))(.*)$ 看到它的實際效果

根據您的要求,它“僅匹配不在文本括號內的單詞”

因此,來自:

一些文本(括號中有更多文本),有些則不在括號中

匹配項: some text + and some not in parentheses

上方鏈接提供了更多示例。


編輯 :由於問題已更改,因此更改了答案。

為了捕獲不在括號內的所有提及,我將使用一些代碼而不是大量的正則表達式。

這樣的事情會讓您接近:

import re

pattern = r"(on the)"

test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''

match_list = test_text.split('\n')

for line in match_list:
    print line, "->",

    bracket_pattern = r"(\(.*\))" #remove everything between ()
    brackets = re.findall(bracket_pattern, line)
    for match in brackets:
        line = line.replace(match,"")

    matches = re.findall(pattern, line)
    for match in matches:
        print match

    print "\r"

輸出:

john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach -> 
bob is at the pool (berkeley) -> 
the spon (is on the table) -> 

我認為正則表達式在一般情況下不會幫助您。 對於您的示例,此正則表達式將按您希望的方式工作:

((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])

描述:

(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below 
                 can be matched
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally
    .{3} matches any character (except newline)
        Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below 
                can be matched
    .{3} matches any character (except newline)
        Quantifier: Exactly 2 times
    [^\(\)] match a single character not present in the list below
        \( matches the character ( literally
        \) matches the character ) literally

如果要將問題推廣到括號和要搜索的字符串之間的任何字符串,則此正則表達式將不起作用。 問題是括號和您的字符串之間的字符串長度 在正則表達式中,不允許Lookbehind量詞是不確定的。

在我的正則表達式中,我使用肯定的Lookahead和肯定的Lookbehind,使用否定的結果也可以實現相同的結果,但是問題仍然存在。

建議:編寫一個小的python代碼,如果不包含括號之間的文本,則可以檢查整行,因為僅regex不能勝任。

例:

import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
    for item in unWanted:
        if item in line:
            mylist.remove(line)
# look for what you want
for line in mylist:
    if mystr in line:
        print line

哪里:

mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.

希望這會有所幫助。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM