两个字符串之间的正则表达式文本

Question

I am trying to extract data fields from PDF texts using regex. 我正在尝试使用正则表达式从PDF文本中提取数据字段。

The text is: 文本是：

"SAMPLE EXPERIAN CUSTOMER\\n2288150 - EXPERIAN SAMPLE REPORTS\\nData Dictionary Report\\nFiltered By:\\nCustom Selection\\nMarketing Element:\\nPage 1 of 284\\n2014-11-11 21:52:01 PM\\nExperian and the marks used herein are service marks or registered trademarks of Experian.\\n© Experian 2014 All rights reserved. Confidential and proprietary.\\n**Data Dictionary**\\nDate of Birth is acquired from public and proprietary files. These sources provide, at a minimum, the year of birth; the month is provided where available. Exact date of birth at various levels of detail is available for \\n\\n\\n\\n\\n\\nNOTE: Records coded with DOB are exclusive of Estimated Age (101E)\\n**Element Number**\\n0100\\nDescription\\nDate Of Birth / Exact Age\\n**Data Dictionary**\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nFiller, three bytes\\n**Element Number**\\n0000\\n**Description**\\nEnhancement Mandatory Append\\n**Data Dictionary**\\n\\n\\nWhen there is insufficient data to match a customer's record to our enrichment master for estimated age, a median estimated age based on the ages of all other adult individuals in the same ZIP+4 area is provided. \\n\\n\\n\\n\\n\\n\\n00 = Unknown\\n**Element Number**\\n0101E\\n**Description**\\nEstimated Age\\n"

The field names are in bold. 字段名称以粗体显示。 The texts between field names are the field values. 字段名称之间的文本是字段值。

The first time I tried to extract the 'Description' field using the following regex: 我第一次尝试使用以下正则表达式提取“说明”字段：

pattern = re.compile('\nDescription\n(.*?)\nData Dictionary\n')
re.findall(pattern,text)

The results are correct: 结果是正确的：

['Date Of Birth / Exact Age', 'Enhancement Mandatory Append']

But using the same idea to extract 'Data Dictionary' Field gives the empty result: 但是使用相同的想法来提取“数据字典”字段会得到空结果：

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n')
re.findall(pattern,text)

Results: 结果：

[]

Any idea why? 知道为什么吗？

Answer 1

. doesn't match newlines by default. 默认情况下不匹配换行符。 Try: 尝试：

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', flags=re.DOTALL)
re.findall(pattern,text)

Notice how I passed re.DOTALL as the flags argument to re.compile . 注意我如何将re.DOTALL作为flags参数传递给re.compile 。

Answer 2

Try using the flag re.MULTILINE in your regex: 尝试在正则表达式中使用标志re.MULTILINE ：

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', re.MULTILINE)
re.findall(pattern,text)

两个字符串之间的正则表达式文本

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-08-07 19:37:38

解决方案2
1 2015-08-07 19:43:17

两个字符串之间的正则表达式文本

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-08-07 19:37:38

解决方案2 1 2015-08-07 19:43:17

解决方案1
2 已采纳 2015-08-07 19:37:38

解决方案2
1 2015-08-07 19:43:17