简体   繁体   English

如何捕捉前瞻性的后瞻性正则表达式python

[英]How to capture both lookahead lookbehind regex python

Here is a string: 这是一个字符串:

str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"

I'd like to capture from "ADDTIONAL" to "Languages" so I wrote this regex: 我想从“ADDTIONAL”捕获到“语言”,所以我写了这个正则表达式:

regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'

However it only catches everything in between ([\\s\\S]+) . 然而它只能捕捉到它们之间的所有东西([\\s\\S]+) It does NOT catch ADDTIONAL & Languages . 它不会捕获ADDTIONALLanguages What am I missing here? 我在这里错过了什么?

Your regex is 你的正则表达式是

regex = r'(?<=\n(ADDITIONAL|Additional)\n)[\s\S]+?(?=\n(Languages|LANGUAGES)\n*)'

and your string is 你的字符串是

Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
                     ^^                          ^^
                     ||                          ||
Match Position:-(?<=\n(ADDITIONAL|Additional)\n)(?=\n(Languages|LANGUAGES)\n*)

So [\\s\\S]+? 那么[\\s\\S]+? will contain the contents in between these two positions excluding ADDITIONAL and LANGUAGES . 将包含这两个位置之间的内容,不包括ADDITIONALLANGUAGES

You just have to find the starting position of ADDITIONAL and ending position of LANGUAGES . 你只需要找到ADDITIONAL的起始位置和LANGUAGES结束位置。 This can be done using the following regex 这可以使用以下正则表达式完成

(?=\n(ADDITIONAL|Additional)\n)([\s\S]+?)(?<=\n(Languages|LANGUAGES)\b)

Further, if you want [\\s\\S]+? 此外,如果你想要[\\s\\S]+? only to capture all contents, then you can use non capturing groups for Additional and Languages 只捕获所有内容,然后您可以使用非捕获组来AdditionalLanguages

(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)

Academy \nADDITIONAL\nAwards and Recognition: ... \nLanguages:
        ^^                                                  ^^
        ||                                                  ||
(?=\n(ADDITIONAL|Additional)\n)             (?<=\n(Languages|LANGUAGES))

Python Code Python代码

p = re.compile(r'(?=\n(?:ADDITIONAL|Additional)\n)[\s\S]+?(?<=\n(?:Languages|LANGUAGES)\b)', re.MULTILINE)
test_str = "Academy \nADDITIONAL\nAwards and Recognition: Greek Man of the Year 2011 Stanford PanHellenic Community, American Delegate 2010 Global\nEngagement Summit, Honorary Speaker 2010 SELA Convention, Semi-Finalist 2010 Strauss Foundation Scholarship Program\nComputer Skills: Competency: MATLAB, MySQL/PHP, JavaScript, Objective-C, Git Proficiency: Adobe Creative Suite, Excel\n(highly advanced), PowerPoint, HTML5/CSS3\nLanguages: Fluent English, Advanced Spanish\n\x0c"
print(re.findall(p, test_str))

Ideone Demo Ideone演示

It is being captured but it's not part of capture group 0 because group 0 它被捕获但它不是捕获组0的一部分,因为组0
contains only the consumed match, ie the match that moved the current 仅包含消耗的匹配,即移动当前的匹配
position. 位置。

Assertions don't move the position, so if you capture inside an assertion 断言不会移动位置,所以如果你捕获一个断言
it does not become part of the match. 它不会成为比赛的一部分。

However if the assertion were followed by some sub-expression that consumed 但是,如果断言之后是一些消耗的子表达式
the ones referenced in the assertion, it would become part of the overall match. 在断言中引用的那些,它将成为整体匹配的一部分。

Your current regex will not match your string. 您当前的正则表达式与您的字符串不匹配。 To match the string you have 匹配你拥有的字符串
to remove the newlines \\n references. 删除换行符\\n引用。

 (?<=
      ( ADDITIONAL | Additional )   # (1)
 )
 [\s\S]+? 
 (?=
      ( Languages | LANGUAGES )     # (2)
 )

If you want to include them in the match, don't put them in lookarounds, since the purpose of those is to test for surrounding text without including it in the match result. 如果你想在匹配中包含它们,不要把它们放在外观中,因为它们的目的是测试周围的文本而不在匹配结果中包含它。 Use ordinary non-capturing groups if you just need alternation. 如果您只是需要更换,请使用普通的非捕获组。

regex = r'\n(?:ADDITIONAL|Additional)\n[\s\S]+?\n(?:Languages|LANGUAGES)\n*'

BTW, your regexp requires newlines around ADDITIONAL and Languages , but there aren't any in your string. 顺便说一句,你的正则表达式需要在ADDITIONALLanguages周围换行,但你的字符串中没有任何换行符。

Try this 试试这个

(?<=ADDITIONAL\s).*?(?=\sLanguages)

Regex demo 正则表达式演示

Explanation: 说明:
(?<=…) : Positive lookbehind sample (?<=…) :正面的后视样本
\\s : "whitespace character": space, tab, newline, carriage return, vertical tab sample \\s :“空白字符”:空格,制表符,换行符,回车符,垂直制表符样本
. : Any character except line break sample :除了换行符的任何字符样本
* : Zero or more times sample *样品零次或多次
? : Once or none sample :一次或不一次样品
(?=…) : Positive lookahead sample (?=…) :前瞻性样本

Python: 蟒蛇:

import re
p = re.compile(ur'(?<=ADDITIONAL\s).*?(?=\sLanguages)', re.IGNORECASE)
test_str = u"the companys direction ADDITIONAL Awards: 2010 Global Engagement Summit, Languages: Fluent Japanese"

g = re.findall(p, test_str)
print g //[u'Awards: 2010 Global Engagement Summit,']

If you need to just capture content inclusive of ADDITIONAL and LANGUAGES , use simple regex like this. 如果您需要捕获包含ADDITIONALLANGUAGES ,请使用这样的简单正则表达式。

\b(ADDITIONAL .* Languages)\b

Make sure you are including re.IGNORECASE flag when using in solution. 在解决方案中使用时,请确保包含re.IGNORECASE标志。

See demo at REGEX101 请参阅REGEX101上的演示

I guess you're complicating something easy, ie: 我想你很容易让事情复杂化,即:

match = re.search("(ADDITIONAL.*?Languages)", subject, re.MULTILINE)

Regex explanation: 正则表达式解释:

(ADDITIONAL.*?Languages)


Match the regex below and capture its match into backreference number 1 «(ADDITIONAL.*?Languages)»
   Match the character string “ADDITIONAL” literally (case sensitive) «ADDITIONAL»
   Match any single character that is NOT a line break character (line feed) «.*?»
      Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
   Match the character string “Languages” literally (case sensitive) «Languages»

Regex101 Demo Regex101演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM