用正则表达式替换重复出现的子字符串？

Question

I am trying to remove the table descriptions from the following text so that only the non table text remains.我试图从以下文本中删除表格描述，以便只保留非表格文本。 I have been playing with regex101.com but can't seem to find pattern that actually does this (it always takes the whole section).我一直在玩 regex101.com 但似乎无法找到实际执行此操作的模式（它总是需要整个部分）。 What am I missing here?我在这里缺少什么？

TABLE 37-1 Text over multiple lines that describes the table (.pdf)表 37-1 描述表格的多行文本 (.pdf)

Non table text.非表格文本。

TABLE 37-2 Text over multiple lines that describes the table (.pdf)表 37-2 描述表格的多行文本 (.pdf)

import re
text = 'string of text in block quotes above'
processed_text = re.sub(r'(TABLE)(.|\n)*(\(\.pdf\))', r'', text)
print (processed_text)

Answer 1

Rather than replacing the unwanted text with the empty string, this extracts the wanted text.这不是用空字符串替换不需要的文本，而是提取需要的文本。

>>> import re                                                                   
>>>                                                                             
>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                   
>>>                                                                             
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\n'

Should also work if there are "TABLE ... (.pdf)" strings in the non-table text.如果非表格文本中有"TABLE ... (.pdf)"字符串，也应该有效。

>>> text = '''TABLE 37-1 Text over multiple 
...: lines that describes the table (.pdf) 
...: Non table text line 1. 
...: Non table text line 2. 
...: TABLE 37-2 non table text that 
...: starts with TABLE and ends with (.pdf)(.pdf) 
...: TABLE 37-2 Text over multiple 
...: lines that describes the table (.pdf)'''                                                 
>>>                                                                                           
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)          
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'

用正则表达式替换重复出现的子字符串？

问题描述

1 个解决方案

解决方案1
0 2020-03-15 08:29:16

用正则表达式替换重复出现的子字符串？

问题描述

1 个解决方案

解决方案1 0 2020-03-15 08:29:16

解决方案1
0 2020-03-15 08:29:16