[英]Replace reoccuring substring with regex?
I am trying to remove the table descriptions from the following text so that only the non table text remains.我试图从以下文本中删除表格描述,以便只保留非表格文本。 I have been playing with regex101.com but can't seem to find pattern that actually does this (it always takes the whole section).我一直在玩 regex101.com 但似乎无法找到实际执行此操作的模式(它总是需要整个部分)。 What am I missing here?我在这里缺少什么?
TABLE 37-1 Text over multiple lines that describes the table (.pdf)表 37-1 描述表格的多行文本 (.pdf)
Non table text.非表格文本。
TABLE 37-2 Text over multiple lines that describes the table (.pdf)表 37-2 描述表格的多行文本 (.pdf)
import re
text = 'string of text in block quotes above'
processed_text = re.sub(r'(TABLE)(.|\n)*(\(\.pdf\))', r'', text)
print (processed_text)
Rather than replacing the unwanted text with the empty string, this extracts the wanted text.这不是用空字符串替换不需要的文本,而是提取需要的文本。
>>> import re
>>>
>>> text = '''TABLE 37-1 Text over multiple
...: lines that describes the table (.pdf)
...: Non table text line 1.
...: Non table text line 2.
...: TABLE 37-2 Text over multiple
...: lines that describes the table (.pdf)'''
>>>
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)
'Non table text line 1.\nNon table text line 2.\n'
Should also work if there are "TABLE ... (.pdf)"
strings in the non-table text.如果非表格文本中有"TABLE ... (.pdf)"
字符串,也应该有效。
>>> text = '''TABLE 37-1 Text over multiple
...: lines that describes the table (.pdf)
...: Non table text line 1.
...: Non table text line 2.
...: TABLE 37-2 non table text that
...: starts with TABLE and ends with (.pdf)(.pdf)
...: TABLE 37-2 Text over multiple
...: lines that describes the table (.pdf)'''
>>>
>>> re.match(r'TABLE.*?\(\.pdf\)\n(.*)TABLE.*?\(\.pdf\)$', text, re.DOTALL).group(1)
'Non table text line 1.\nNon table text line 2.\nTABLE 37-2 non table text that\nstarts with TABLE and ends with (.pdf)(.pdf)\n'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.