简体   繁体   English

正则表达式:匹配两个给定字符串之间的单词(无空格或类似字符)

[英]Regular Expressions: Match words between two given strings (no blank spaces or similar)

I am trying to get a regex that is able to get the words, not getting the blank spaces, between two given strings, at this moment I have this one:我正在尝试获得一个能够在两个给定字符串之间获取单词而不是空格的正则表达式,此时我有这个:

(?<=STR1)(?:\s*)(.*?)(?:\s*)(?=STR2)

I want to use it to get the following results:我想用它来获得以下结果:

WORD0 STR1    WORD1 WORD2 WORD3  
WORD4 WORD5 STR2 WORD6

I want a regex that matches WORD1,WORD2,WORD3,WORD4,WORD5我想要一个匹配WORD1,WORD2,WORD3,WORD4,WORD5的正则表达式

PS: I am working with python, and thank you PS:我正在与python合作,谢谢

You cannot do that with re because 1) it does not support unknown length lookbehind patterns and 2) it has no support for \G operator that can be used to match strings in between two strings.你不能用re做到这一点,因为 1) 它不支持未知长度的后视模式,并且 2) 它不支持可用于匹配两个字符串之间的字符串的\G运算符。

So, what you can do is pip install regex , and then use所以,你可以做的是pip install regex ,然后使用

import regex
text = "WORD0 STR1    WORD1 WORD2 WORD3  \nWORD4 WORD5 STR2 WORD6"
print( regex.findall(r"(?<=STR1.*)\w+(?=.*STR2)", text, regex.DOTALL) )
# => ['WORD1', 'WORD2', 'WORD3', 'WORD4', 'WORD5']

See the Python demo .请参阅Python 演示 Details :详情

  • (?<=STR1.*) - a positive lookbehind matching STR1 and any zero or more chars immediately to the left of the current location (?<=STR1.*) - 正后视匹配STR1和紧邻当前位置左侧的任何零个或多个字符
  • \w+ - one or more word chars \w+ - 一个或多个单词字符
  • (?=.*STR2) - a positive lookahead matching any zero or more chars and STR2 immediately to the right of the current location. (?=.*STR2) - 与当前位置右侧的任何零个或多个字符和STR2匹配的正向先行。

Assuming 'STR1' and 'STR2' are known to be present you can write the following假设已知存在'STR1''STR2' ,您可以编写以下内容

str = "WORD0 STR1    WORD1 WORD2 WORD3\nWORD4 WORD5 STR2 WORD6"
rgx = r'\b(?!.*\bSTR1\b)\w+(?=.*\bSTR2\b)'
re.findall(rgx, str, re.S) 
  #=> ['WORD1', 'WORD2', 'WORD3', 'WORD4', 'WORD5']

re.S (same as re.DOTALL ) causes periods to match all characters, including line terminators. re.S (与re.DOTALL相同)使句点匹配所有字符,包括行终止符。

Regex demo <- \(ツ)/ -> Python demo正则表达式演示<- \(ツ)/ -> Python 演示

The regular expression can be broken down as follows.正则表达式可以分解如下。

\b          # match a word boundary
(?!         # begin a negative lookahead
  .*        # match zero or more characters
  \bSTR1\b  # match 'STR1' with word boundaries
)           # end negative lookahead
\w+         # match zero or more word characters
(?=         # begin a positive lookahead
  .*        # match zero or more characters
  \bSTR1\b  # match 'STR2' with word boundaries
)           # end positive lookahead

Note that the negative lookahead ensures that the matched word ( \w+ ) is not followed by 'STR1' , in which case it must be preceded by that string.请注意,否定先行确保匹配的单词 ( \w+ ) 后面没有跟随着'STR1' ,在这种情况下,它前面必须有该字符串。

Depending on requirements, \w+ might replaced with [AZ]+\d+ or something else.根据要求, \w+可能会替换为[AZ]+\d+或其他内容。

Also note that the word boundary ( \b ) at the beginning of the expression is to avoid matching 'TR1' .另请注意,表达式开头的单词边界 ( \b ) 是为了避免匹配'TR1'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM