简体   繁体   English

正则表达式捕获一组单词后跟一组格式化的数量

[英]Regular expression to capture a group of words followed by a group of formatted quantities

Given the content of a text file (below), I want to extract two values from each line that has the following pattern — capture groups indicated with [#] :给定文本文件的内容(如下),我想从具有以下模式的每一行中提取两个值 - 用[#]指示的捕获组:

  1. An unknown amount of leading whitespace…未知数量的前导空格......
  2. [1] a group of words (each separated by a single space)… [1]一组单词(每个单词由一个空格分隔)...
  3. two or more spaces…两个或更多空间…
  4. [2] a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses… [2]由一串数字表示的数量,其中可能包含逗号并可能用括号括起来……
  5. two or more spaces…两个或更多空间…
  6. a quantity following the same pattern as the former遵循与前者相同模式的数量
  7. an unknown amount of trailing whitespace.未知数量的尾随空格。

The goal is to capture the values under the "Notes" and "2019" columns in the text and put them into a Python dictionary.目标是捕获文本中“Notes”和“2019”列下的值并将它们放入Python字典中。

I tried using the following regular expressions:我尝试使用以下正则表达式:

(\\w+)\\s{1}(\\w+)*

(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)

Example text file:示例文本文件:

                                                    Micro-entity Balance Sheet as at 31 May 2019                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                  Notes            2019           2018                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                         £              £                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Fixed Assets                                                                                             2,046          1,369                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Current Assets                                                                                         53,790         24,799                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Creditors: amounts falling due within one year                                                        (23,146)        (6,106)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Net current assets (liabilities)                                                                       30,644          18,693                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Total assets less current liabilities                                                                  32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Total net assets (liabilities)                                                                         32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Capital and reserves                                                                                   32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 For the year ending 31 May 2019 the company was entities to exemption under section 477 of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 Companies Act 2006 relating to small companies                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The members have not required the company to obtain an audit in accordance with section 476 of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 the Companies Act 2006.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The director acknowledge their responsibilities for complying with the requirements of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 Companies Act 2006 with respect to accounting                               records and the preparation of accounts.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The accounts         have been prepared in accordance with the micro-entity provisions and delivered in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 accordance with the provisions applicable to companies subject to the small companies regime.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       Approved by the Board on 20 December 2019                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       And signed on their behalf by:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       Director                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       This document was delivered using electronic communications and authenticated in accordance with the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       the Companies Act 2006.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

Example valid matches:示例有效匹配:

"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"

You're so close, but so far.你那么近,却又那么远。 Why?为什么?

Your first regular expression…你的第一个正则表达式…

(\w+)\s{1}(\w+)*

…is insufficient because the two capture groups do not take into account the spaces between words in the first case or the quantity formatting in the second case. ... 是不够的,因为两个捕获组没有考虑第一种情况下单词之间的空格或第二种情况下的数量格式。

Your second regular expression…你的第二个正则表达式……

(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)

…is better because it effectively captures groups of words, however eagerly. ...更好,因为它有效地捕获了词组,无论多么急切。

Notes:笔记:

  1. You do not need capture groups around the leading and trailing whitespace.您不需要围绕前导和尾随空格的捕获组。
  2. You do not need brackets around the space character.空格字符周围不需要括号。 The bracket indicates a set of characters, but you only have one character in the set.括号表示一组字符,但您在该组中只有一个字符。

If you modify it slightly by removing the unnecessary capture groups…如果您通过删除不必要的捕获组稍微修改它......

.*? {2,}(.*?) {2,}(.*?) {2,}.*

…you can see that it captures the values under "Notes" and "2019", but it also aggressively captures unwanted text. …您可以看到它捕获了“Notes”和“2019”下的值,但它也积极地捕获了不需要的文本。

You could parse through these matches and discard unwanted ones with Python code.您可以解析这些匹配项并使用 Python 代码丢弃不需要的匹配项。 You don't need a regular expression, but you can be more precise with it.不需要正则表达式,但可以更精确地使用它。

Your regular expression captures unwanted data because you're unnecessarily matching any character with .*?您的正则表达式捕获了不需要的数据,因为您不必要地将任何字符与.*? , when you actually want to limit the matches to: ,当您实际上想要将匹配限制为:

  1. a group of words (each separated by a single space)一组单词(每个单词由一个空格分隔)
  2. a quantity represented by a string of numbers that may contain commas and may be wrapped in parentheses由一串数字表示的数量,可能包含逗号并可能用括号括起来

Only the lines you care about actually follow this pattern.只有您关心的线条才真正遵循此模式。

Consider this:考虑一下:

^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$

View @ Regex101.com查看@ Regex101.com

The above regular expression improves the pattern matching in the following ways:上述正则表达式通过以下方式改进了模式匹配:

  1. Explicitly match beginning of line ^ and end of line $ to prevent matching multiple lines.显式匹配行首^和行尾$以防止匹配多行。
  2. Use a non-capturing group to match one or more words followed by a single space: (?:\\S+ )+使用非捕获组匹配一个或多个后跟一个空格的单词: (?:\\S+ )+
  3. Match non-whitespace characters with \\S to capture "words" and punctuation (eg : ).将非空白字符与\\S匹配以捕获“单词”标点符号(例如: )。
  4. Selectively match only a combination of one or more digits and commas optionally wrapped in parentheses with \\(?[0-9,]+\\)?选择性地仅匹配一个或多个数字和逗号的组合,可选地用\\(?[0-9,]+\\)?

But even this returns the unwanted column headers "Notes" and "2019".但即使这样也会返回不需要的列标题“Notes”和“2019”。 You can use a negative lookahead… (?!Notes) …to prevent matching the line that contains "Notes".您可以使用否定前瞻... (?!Notes) ...来防止匹配包含“Notes”的行。

Final solution:最终解决方案:

^ *((?:(?!Notes)\S+ )+) {2,}((?[0-9,]+)?).*$
View @ Regex101.com 查看@ Regex101.com

You may find it educational to view it as a syntax diagram:您可能会发现将其视为语法图很有教育意义: 建议的最终解决方案的正则表达式语法图 View @ RegExper.com 查看@ RegExper.com

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM