Given the content of a text file (below), I want to extract two values from each line that has the following pattern — capture groups indicated with [#] :
The goal is to capture the values under the "Notes" and "2019" columns in the text and put them into a Python dictionary.
I tried using the following regular expressions:
(\\w+)\\s{1}(\\w+)*
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
Example text file:
Micro-entity Balance Sheet as at 31 May 2019
Notes 2019 2018
£ £
Fixed Assets 2,046 1,369
Current Assets 53,790 24,799
Creditors: amounts falling due within one year (23,146) (6,106)
Net current assets (liabilities) 30,644 18,693
Total assets less current liabilities 32,690 20,062
Total net assets (liabilities) 32,690 20,062
Capital and reserves 32,690 20,062
For the year ending 31 May 2019 the company was entities to exemption under section 477 of the
Companies Act 2006 relating to small companies
® The members have not required the company to obtain an audit in accordance with section 476 of
the Companies Act 2006.
® The director acknowledge their responsibilities for complying with the requirements of the
Companies Act 2006 with respect to accounting records and the preparation of accounts.
® The accounts have been prepared in accordance with the micro-entity provisions and delivered in
accordance with the provisions applicable to companies subject to the small companies regime.
Approved by the Board on 20 December 2019
And signed on their behalf by:
Director
This document was delivered using electronic communications and authenticated in accordance with the
registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of
the Companies Act 2006.
Example valid matches:
"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"
Your first regular expression…
(\w+)\s{1}(\w+)*
…is insufficient because the two capture groups do not take into account the spaces between words in the first case or the quantity formatting in the second case.
Your second regular expression…
(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)
…is better because it effectively captures groups of words, however eagerly.
Notes:
- You do not need capture groups around the leading and trailing whitespace.
- You do not need brackets around the space character. The bracket indicates a set of characters, but you only have one character in the set.
If you modify it slightly by removing the unnecessary capture groups…
.*? {2,}(.*?) {2,}(.*?) {2,}.*
…you can see that it captures the values under "Notes" and "2019", but it also aggressively captures unwanted text.
You could parse through these matches and discard unwanted ones with Python code. You don't need a regular expression, but you can be more precise with it.
Your regular expression captures unwanted data because you're unnecessarily matching any character with .*?
, when you actually want to limit the matches to:
Only the lines you care about actually follow this pattern.
Consider this:
^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$
The above regular expression improves the pattern matching in the following ways:
^
and end of line $
to prevent matching multiple lines.(?:\\S+ )+
\\S
to capture "words" and punctuation (eg :
).\\(?[0-9,]+\\)?
But even this returns the unwanted column headers "Notes" and "2019". You can use a negative lookahead… (?!Notes)
…to prevent matching the line that contains "Notes".
Final solution:
^ *((?:\S+ )+) {2,}((?[0-9,]+)?).*$
View @ Regex101.com
You may find it educational to view it as a syntax diagram: View @ RegExper.com
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.