简体   繁体   English

两个字符串之间的正则表达式文本

[英]Regex text between two strings

I am trying to extract data fields from PDF texts using regex. 我正在尝试使用正则表达式从PDF文本中提取数据字段。

The text is: 文本是:

"SAMPLE EXPERIAN CUSTOMER\\n2288150 - EXPERIAN SAMPLE REPORTS\\nData Dictionary Report\\nFiltered By:\\nCustom Selection\\nMarketing Element:\\nPage 1 of 284\\n2014-11-11 21:52:01 PM\\nExperian and the marks used herein are service marks or registered trademarks of Experian.\\n© Experian 2014 All rights reserved. Confidential and proprietary.\\n**Data Dictionary**\\nDate of Birth is acquired from public and proprietary files. These sources provide, at a minimum, the year of birth; the month is provided where available. Exact date of birth at various levels of detail is available for \\n\\n\\n\\n\\n\\nNOTE: Records coded with DOB are exclusive of Estimated Age (101E)\\n**Element Number**\\n0100\\nDescription\\nDate Of Birth / Exact Age\\n**Data Dictionary**\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nFiller, three bytes\\n**Element Number**\\n0000\\n**Description**\\nEnhancement Mandatory Append\\n**Data Dictionary**\\n\\n\\nWhen there is insufficient data to match a customer's record to our enrichment master for estimated age, a median estimated age based on the ages of all other adult individuals in the same ZIP+4 area is provided. \\n\\n\\n\\n\\n\\n\\n00 = Unknown\\n**Element Number**\\n0101E\\n**Description**\\nEstimated Age\\n"

The field names are in bold. 字段名称以粗体显示。 The texts between field names are the field values. 字段名称之间的文本是字段值。

The first time I tried to extract the 'Description' field using the following regex: 我第一次尝试使用以下正则表达式提取“说明”字段:

pattern = re.compile('\nDescription\n(.*?)\nData Dictionary\n')
re.findall(pattern,text)

The results are correct: 结果是正确的:

['Date Of Birth / Exact Age', 'Enhancement Mandatory Append']

But using the same idea to extract 'Data Dictionary' Field gives the empty result: 但是使用相同的想法来提取“数据字典”字段会得到空结果:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n')
re.findall(pattern,text)

Results: 结果:

[]

Any idea why? 知道为什么吗?

. doesn't match newlines by default. 默认情况下不匹配换行符。 Try: 尝试:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', flags=re.DOTALL)
re.findall(pattern,text)

Notice how I passed re.DOTALL as the flags argument to re.compile . 注意我如何将re.DOTALL作为flags参数传递给re.compile

Try using the flag re.MULTILINE in your regex: 尝试在正则表达式中使用标志re.MULTILINE

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', re.MULTILINE)
re.findall(pattern,text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM