[英]Regular expression to capture a group of words followed by a group of formatted quantities
给定文本文件的内容(如下),我想从具有以下模式的每一行中提取两个值 - 用[#]指示的捕获组:
目标是捕获文本中“Notes”和“2019”列下的值并将它们放入Python字典中。
我尝试使用以下正则表达式:
(\\w+)\\s{1}(\\w+)*
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
示例文本文件:
Micro-entity Balance Sheet as at 31 May 2019
Notes 2019 2018
£ £
Fixed Assets 2,046 1,369
Current Assets 53,790 24,799
Creditors: amounts falling due within one year (23,146) (6,106)
Net current assets (liabilities) 30,644 18,693
Total assets less current liabilities 32,690 20,062
Total net assets (liabilities) 32,690 20,062
Capital and reserves 32,690 20,062
For the year ending 31 May 2019 the company was entities to exemption under section 477 of the
Companies Act 2006 relating to small companies
® The members have not required the company to obtain an audit in accordance with section 476 of
the Companies Act 2006.
® The director acknowledge their responsibilities for complying with the requirements of the
Companies Act 2006 with respect to accounting records and the preparation of accounts.
® The accounts have been prepared in accordance with the micro-entity provisions and delivered in
accordance with the provisions applicable to companies subject to the small companies regime.
Approved by the Board on 20 December 2019
And signed on their behalf by:
Director
This document was delivered using electronic communications and authenticated in accordance with the
registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of
the Companies Act 2006.
示例有效匹配:
"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"
你的第一个正则表达式…
(\w+)\s{1}(\w+)*
... 是不够的,因为两个捕获组没有考虑第一种情况下单词之间的空格或第二种情况下的数量格式。
你的第二个正则表达式……
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
...更好,因为它有效地捕获了词组,无论多么急切。
笔记:
- 您不需要围绕前导和尾随空格的捕获组。
- 空格字符周围不需要括号。 括号表示一组字符,但您在该组中只有一个字符。
如果您通过删除不必要的捕获组稍微修改它......
.*? {2,}(.*?) {2,}(.*?) {2,}.*
…您可以看到它捕获了“Notes”和“2019”下的值,但它也积极地捕获了不需要的文本。
您可以解析这些匹配项并使用 Python 代码丢弃不需要的匹配项。 您不需要正则表达式,但可以更精确地使用它。
您的正则表达式捕获了不需要的数据,因为您不必要地将任何字符与.*?
,当您实际上想要将匹配限制为:
只有您关心的线条才真正遵循此模式。
考虑一下:
^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$
上述正则表达式通过以下方式改进了模式匹配:
^
和行尾$
以防止匹配多行。(?:\\S+ )+
\\S
匹配以捕获“单词”和标点符号(例如:
)。\\(?[0-9,]+\\)?
但即使这样也会返回不需要的列标题“Notes”和“2019”。 您可以使用否定前瞻... (?!Notes)
...来防止匹配包含“Notes”的行。
最终解决方案:
^ *((?:(?!Notes)\S+ )+) {2,}((?[0-9,]+)?).*$
查看@ Regex101.com
您可能会发现将其视为语法图很有教育意义: 查看@ RegExper.com
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.