简体   繁体   English

用正则表达式替换重复出现的子字符串?

[英]Replace reoccuring substring with regex?

I am trying to remove the table descriptions from the following text so that only the non table text remains.我试图从以下文本中删除表格描述,以便只保留非表格文本。 I have been playing with regex101.com but can't seem to find pattern that actually does this (it always takes the whole section).我一直在玩 regex101.com 但似乎无法找到实际执行此操作的模式(它总是需要整个部分)。 What am I missing here?我在这里缺少什么?

TABLE 37-1 Text over multiple lines that describes the table (.pdf)表 37-1 描述表格的多行文本 (.pdf)

Non table text.非表格文本。

TABLE 37-2 Text over multiple lines that describes the table (.pdf)表 37-2 描述表格的多行文本 (.pdf)

import re
text = 'string of text in block quotes above'
processed_text = re.sub(r'(TABLE)(.|\n)*(\(\.pdf\))', r'', text)
print (processed_text)

Rather than replacing the unwanted text with the empty string, this extracts the wanted text.这不是用空字符串替换不需要的文本,而是提取需要的文本。

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'

Should also work if there are "TABLE ... (.pdf)" strings in the non-table text.如果非表格文本中有"TABLE ... (.pdf)"字符串,也应该有效。

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM