简体   繁体   English

查找两个字符串之间的文本

[英]Find text between two strings

I've a big text like the following excerpt: 我有一段大文字,例如以下摘录:

test = '''
Sra. Montero.- ¡No, no! No empecemos.   
Sr. Jefe de Gabinete de Ministros.- Respetuosamente se lo digo...   
Sra. Montero.- El senador Fernández
Sra. Montero.- ¡No, no! No empecemos.   
Sr. Jefe de Gabinete de Ministros.- Respetuosamente se lo digo...   
Sra. Montero.- El senador Fernández
Sra. Montero.- ¡No, no! No empecemos.   
Sr. Jefe de Gabinete de Ministros.- Respetuosamente se lo digo...   
Sra. Montero.- El senador Fernández
Sra. Montero.- ¡No, no! No empecemos.   
Sr. Jefe de Gabinete de Ministros.- Respetuosamente se lo digo...   
Sra. Montero.- El senador Fernández
'''

I'd like to get all the text between the string "Sr. Jefe de Gabinete de Ministros.-" and the string "Sr{{ random_text_here }}.-". 我想获取字符串“ Sr. Jefe de Gabinete de Ministros.-”和字符串“ Sr {{random_text_here}} .-”之间的所有文本。 So in this example what I'd like to get would be the following: 因此,在此示例中,我想要得到的是以下内容:

data = ['Respetuosamente se lo digo...', 'Respetuosamente se lo digo...', 'Respetuosamente se lo digo...']

I know the regex clause has to be non-greedy and I already tested something like this: 我知道regex子句必须是非贪婪的,并且我已经测试过类似的东西:

bw_sr = re.compile('\.\-(.+?)Sr[.+]\.\-') #non greedy regexx              
data = bw_sr.findall(test)

But I end up getting an empty list. 但是我最终得到一个空名单。 I tried several clauses but I can't seem to get to a solution. 我尝试了几个子句,但似乎无法解决。

your regex was wrong (this one [.+] was between brackets which defined a character range, so it wasn't working, among other issues, like no way to distinguish between "Sr." and "Sra" (seems what you wanted to do seeing the output), which I fixed by doing Sr\\. ). 您的正则表达式是错误的(此[.+]位于定义字符范围的方括号之间,因此它不起作用,还有其他问题,例如无法区分“ Sr.”和“ Sra”(似乎是您想要的)来查看输出),这是通过Sr\\.修复的。

I came up with that one which matches the formulas and also "El senador Fernández", etc... there's no criterion to filter those. 我想出了一个与公式以及“ El senadorFernández”相匹配的公式,等等。。。没有筛选标准。 I also added \\s* before the capturing group to "strip" blanks: 我还在捕获组之前添加了\\s*来“剥离”空白:

bw_sr = re.compile('\.\-\s*(.+?)\nSr\..+?\.\-')
data = bw_sr.findall(test)

print(data)

result: 结果:

['¡No, no! No empecemos.', '¡No, no! No empecemos.', '¡No, no! No empecemos.', '¡No, no! No empecemos.']

It's work: 是工作:

bw_sr = re.compile('\.\- (.*)')
data = bw_sr.findall(test)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM