简体   繁体   中英

Regular expression to capture a group of words followed by a group of formatted quantities

Given the content of a text file (below), I want to extract two values from each line that has the following pattern — capture groups indicated with [#] :

  1. An unknown amount of leading whitespace…
  2. [1] a group of words (each separated by a single space)…
  3. two or more spaces…
  4. [2] a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses…
  5. two or more spaces…
  6. a quantity following the same pattern as the former
  7. an unknown amount of trailing whitespace.

The goal is to capture the values under the "Notes" and "2019" columns in the text and put them into a Python dictionary.

I tried using the following regular expressions:

(\\w+)\\s{1}(\\w+)*

(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)

Example text file:

                                                    Micro-entity Balance Sheet as at 31 May 2019                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                  Notes            2019           2018                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                         £              £                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Fixed Assets                                                                                             2,046          1,369                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Current Assets                                                                                         53,790         24,799                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Creditors: amounts falling due within one year                                                        (23,146)        (6,106)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Net current assets (liabilities)                                                                       30,644          18,693                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Total assets less current liabilities                                                                  32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Total net assets (liabilities)                                                                         32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Capital and reserves                                                                                   32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 For the year ending 31 May 2019 the company was entities to exemption under section 477 of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 Companies Act 2006 relating to small companies                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The members have not required the company to obtain an audit in accordance with section 476 of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 the Companies Act 2006.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The director acknowledge their responsibilities for complying with the requirements of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 Companies Act 2006 with respect to accounting                               records and the preparation of accounts.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The accounts         have been prepared in accordance with the micro-entity provisions and delivered in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 accordance with the provisions applicable to companies subject to the small companies regime.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       Approved by the Board on 20 December 2019                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       And signed on their behalf by:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       Director                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       This document was delivered using electronic communications and authenticated in accordance with the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       the Companies Act 2006.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

Example valid matches:

"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"

You're so close, but so far. Why?

Your first regular expression…

(\w+)\s{1}(\w+)*

…is insufficient because the two capture groups do not take into account the spaces between words in the first case or the quantity formatting in the second case.

Your second regular expression…

(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)

…is better because it effectively captures groups of words, however eagerly.

Notes:

  1. You do not need capture groups around the leading and trailing whitespace.
  2. You do not need brackets around the space character. The bracket indicates a set of characters, but you only have one character in the set.

If you modify it slightly by removing the unnecessary capture groups…

.*? {2,}(.*?) {2,}(.*?) {2,}.*

…you can see that it captures the values under "Notes" and "2019", but it also aggressively captures unwanted text.

You could parse through these matches and discard unwanted ones with Python code. You don't need a regular expression, but you can be more precise with it.

Your regular expression captures unwanted data because you're unnecessarily matching any character with .*? , when you actually want to limit the matches to:

  1. a group of words (each separated by a single space)
  2. a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses

Only the lines you care about actually follow this pattern.

Consider this:

^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$

View @ Regex101.com

The above regular expression improves the pattern matching in the following ways:

  1. Explicitly match beginning of line ^ and end of line $ to prevent matching multiple lines.
  2. Use a non-capturing group to match one or more words followed by a single space: (?:\\S+ )+
  3. Match non-whitespace characters with \\S to capture "words" and punctuation (eg : ).
  4. Selectively match only a combination of one or more digits and commas optionally wrapped in parentheses with \\(?[0-9,]+\\)?

But even this returns the unwanted column headers "Notes" and "2019". You can use a negative lookahead… (?!Notes) …to prevent matching the line that contains "Notes".

Final solution:

^ *((?:\S+ )+) {2,}((?[0-9,]+)?).*$
View @ Regex101.com

You may find it educational to view it as a syntax diagram: 建议的最终解决方案的正则表达式语法图 View @ RegExper.com

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM