简体   繁体   English

正则表达式以一行中的CAPITAL词开始和结束,在CAPITAL单行词中的多行

[英]Regex starts and ends with CAPITAL word in a line, several lines amid CAPITAL single-line words

I want to know the regexp for the following case: 我想知道以下情况的正则表达式:

The string contains an uppercase word in a single line with two newlines before. 该字符串在单行中包含一个大写单词,前两个换行符。 After that, there are several lines of alphanumeric letters (maybe non-ASCII utf-8) or maybe an empty line. 之后,会有几行字母数字字母(可能是非ASCII utf-8)或空行。 I want to capture the whole portion starting with the uppercase word in a line and ends just before next uppercase word-line. 我想捕获整个部分,从一行中的大写单词开始,到下一个大写单词行之前结束。 Single-liner uppercase words may have duplicates. 单行大写单词可能有重复项。

I explored and looked up a lot but failed. 我探索并抬头很多,但失败了。

Example

ASDF
wqer rtre 34 $^&% fsfa
DDwrgd 43 er 1. ewrtfg
324rfegf 4gfgre

PIIPUU
gre tt HKH rre345 
sdrfetre
ewrewrqwr werfewrt34vds

ret
gre
wretretertettre

PIIPUU
asdf reb dsfdsg
dsafdfbh rt3456 rge grefgreg
reretr erfret34 ef

retretretr

QWE
pritoy Fbhfg 45345 )*9
tret 345 gret54
retre 56 gre ger
retgrh 546ttre

MMNNBMB
aserew Sfjlkjf
gdf
rerettyrdfv re HFGHFFHF er
ergre ret retre 
ret retretret 

reg regrtgh rertgre tret

I want to separate all the portions that match the condition like bellow: 我想将所有符合条件的部分分开,例如波纹管:

ASDF
wqer rtre 34 $^&% fsfa
DDwrgd 43 er 1. ewrtfg
324rfegf 4gfgre
PIIPUU
gre tt HKH rre345 
sdrfetre
ewrewrqwr werfewrt34vds

ret
gre
wretretertettre
PIIPUU
asdf reb dsfdsg
dsafdfbh rt3456 rge grefgreg
reretr erfret34 ef

retretretr
QWE
pritoy Fbhfg 45345 )*9
tret 345 gret54
retre 56 gre ger
retgrh 546ttre
MMNNBMB
aserew Sfjlkjf
gdf
rerettyrdfv re HFGHFFHF er
ergre ret retre 
ret retretret 

reg regrtgh rertgre tret

Here is one approach using re.findall : 这是使用re.findall一种方法:

matches = re.findall(r'(?:^|\n\n)([A-Z]{3,}.*?)(?=\n\n[A-Z]{3,}\n|$)', input, flags=re.DOTALL)
print(matches)

This prints: 打印:

['ASDF\nwqer rtre 34 $^&% fsfa\nDDwrgd 43 er 1. ewrtfg\n324rfegf 4gfgre',
 'QWE\npritoy Fbhfg 45345 )*9\ntret 345 gret54\nretre 56 gre ger\nretgrh 546ttre',
 'PIIPUU\ngre tt HKH rre345 \nsdrfetre\newrewrqwr werfewrt34vds\n\nret\ngre\nwretretertettre',
 'MMNNBMB\naserew Sfjlkjf\ngdf\nrerettyrdfv re HFGHFFHF er\nergre ret retre \nret retretret \n\nreg regrtgh rertgre tret']

Here is an explanation of the regex pattern being used: 这是使用的正则表达式模式的解释:

(?:^|\n\n)      match either the start of the input or two consecutive newlines
([A-Z]{3,}.*?)  then match and capture three or more capital letters,
                followed by all content (including newlines) until seeing
(?=\n\n[A-Z]{3,}\n|$)  either two newlines and a capital term or the end of the input

This expression is likely to extract our desired outputs: 该表达式可能会提取我们期望的输出:

(?=^[A-Z]+$)([\s\S]*?)(?=^[A-Z]+$)|([\s\S]*)

The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it. 如果您想探索/简化/修改该表达式,请在此演示的右上方面板中进行解释。

Test 测试

import re

regex = r"(?=^[A-Z]+$)([\s\S]*?)(?=^[A-Z]+$)|([\s\S]*)"

test_str = """

ASDF
wqer rtre 34 $^&% fsfa
DDwrgd 43 er 1. ewrtfg
324rfegf 4gfgre

QWE
pritoy Fbhfg 45345 )*9
tret 345 gret54
retre 56 gre ger
retgrh 546ttre

PIIPUU
gre tt HKH rre345 
sdrfetre
ewrewrqwr werfewrt34vds

ret
gre
wretretertettre

MMNNBMB
aserew Sfjlkjf
gdf
rerettyrdfv re HFGHFFHF er
ergre ret retre 
ret retretret 

reg regrtgh rertgre tret

"""

print(re.findall(regex, test_str, re.MULTILINE))

Output 输出量

[('', ''), ('ASDF\nwqer rtre 34 $^&% fsfa\nDDwrgd 43 er 1. ewrtfg\n324rfegf 4gfgre\n\n', ''), ('', ''), ('QWE\npritoy Fbhfg 45345 )*9\ntret 345 gret54\nretre 56 gre ger\nretgrh 546ttre\n\n', ''), ('', ''), ('PIIPUU\ngre tt HKH rre345 \nsdrfetre\newrewrqwr werfewrt34vds\n\nret\ngre\nwretretertettre\n\n', ''), ('', ''), ('', 'MMNNBMB\naserew Sfjlkjf\ngdf\nrerettyrdfv re HFGHFFHF er\nergre ret retre \nret retretret \n\nreg regrtgh rertgre tret'), ('', '')]

Try this: 尝试这个:

regex = re.compile(r"^[A-Z]+\r?\n(?:(?!^\r?\n[A-Z]+\r?\n).)*", re.MULTILINE|re.DOTALL)

Explanation: 说明:

^                      # Start of line
[A-Z]+                 # Match uppercase ASCII keyword
\r?\n                  # Match newline
(?:                    # Start of non-capturing group
 (?!^\r?\n[A-Z]+\r?\n) # Make sure we're not (yet) at the start of another keyword
 .                     # If so, match any character including newline
)*                     # Repeat any number of times.

Test it live on regex101.com . 在regex101.com上进行实时测试。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 正则表达式灾难性的回溯; 提取单词以大写字母开头,然后是特定单词 - regex catastrophic backtracking ; extracting words starts with capital before the specific word 按换行符和大写字母的正则表达式拆分 - Split by regex of new line and capital letter 正则表达式 - 在字符串中查找大写单词 - Regex - finding capital words in string 正则表达式以匹配以确切单词而不是相似单词结尾的行 - Regex to match the line that ends with the exact word rather than similar words 如何构建这个正则表达式,以便它提取一个以大写字母开头的单词,前提是它出现在前一个模式之后? - How to build this regex so that it extracts a word that starts with a capital letter if only if it appears after a previous pattern? 使用正则表达式查找不是在句子开头的大写字母 - Find words with capital letters not at start of a sentence with regex 正则表达式将单词与首字母大写匹配 - Regex to match words with first capital letter 以“a”开头,以“a”结尾的首都。 字母“a”是大写还是小写都没关系 - Capital city that starts with “a”, and ends with “a”. Doesn't matter if letter “a” is uppercase or lowercase 带管道的单线输出 - Single-line output with pipe 多个单行循环 - Multiple single-line for loops
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM