簡體   English   中英

正則表達式捕獲一組單詞后跟一組格式化的數量

[英]Regular expression to capture a group of words followed by a group of formatted quantities

給定文本文件的內容(如下),我想從具有以下模式的每一行中提取兩個值 - 用[#]指示的捕獲組:

  1. 未知數量的前導空格......
  2. [1]一組單詞(每個單詞由一個空格分隔)...
  3. 兩個或更多空間…
  4. [2]由一串數字表示的數量,其中可能包含逗號並可能用括號括起來……
  5. 兩個或更多空間…
  6. 遵循與前者相同模式的數量
  7. 未知數量的尾隨空格。

目標是捕獲文本中“Notes”和“2019”列下的值並將它們放入Python字典中。

我嘗試使用以下正則表達式:

(\\w+)\\s{1}(\\w+)*

(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)

示例文本文件:

                                                    Micro-entity Balance Sheet as at 31 May 2019                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                  Notes            2019           2018                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                                                                                                                                         £              £                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Fixed Assets                                                                                             2,046          1,369                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Current Assets                                                                                         53,790         24,799                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Creditors: amounts falling due within one year                                                        (23,146)        (6,106)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Net current assets (liabilities)                                                                       30,644          18,693                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Total assets less current liabilities                                                                  32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Total net assets (liabilities)                                                                         32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                         Capital and reserves                                                                                   32,690         20,062                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 For the year ending 31 May 2019 the company was entities to exemption under section 477 of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 Companies Act 2006 relating to small companies                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The members have not required the company to obtain an audit in accordance with section 476 of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 the Companies Act 2006.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The director acknowledge their responsibilities for complying with the requirements of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 Companies Act 2006 with respect to accounting                               records and the preparation of accounts.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
           ®     The accounts         have been prepared in accordance with the micro-entity provisions and delivered in                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
                 accordance with the provisions applicable to companies subject to the small companies regime.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       Approved by the Board on 20 December 2019                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       And signed on their behalf by:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       Director                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       This document was delivered using electronic communications and authenticated in accordance with the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       registrar's rules relating to electronic form, authentication and manner of delivery under section 1072 of                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
       the Companies Act 2006.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

示例有效匹配:

"Fixed Assets", "2,046"
"Current Assets", "53,790"
"Creditors: amounts falling due within one year", "(23,146)"
"Net current assets (liabilities)", "30,644"
"Total assets less current liabilities", "32,690"
"Total net assets (liabilities)", "32,690"
"Capital and reserves", "32,690"

你那么近,卻又那么遠。 為什么?

你的第一個正則表達式…

(\w+)\s{1}(\w+)*

... 是不夠的,因為兩個捕獲組沒有考慮第一種情況下單詞之間的空格或第二種情況下的數量格式。

你的第二個正則表達式……

(.*?)[ ]{2,}(.*?)[ ]{2,}(.*?)[ ]{2,}(.*)

...更好,因為它有效地捕獲了詞組,無論多么急切。

筆記:

  1. 您不需要圍繞前導和尾隨空格的捕獲組。
  2. 空格字符周圍不需要括號。 括號表示一組字符,但您在該組中只有一個字符。

如果您通過刪除不必要的捕獲組稍微修改它......

.*? {2,}(.*?) {2,}(.*?) {2,}.*

…您可以看到它捕獲了“Notes”和“2019”下的值,但它也積極地捕獲了不需要的文本。

您可以解析這些匹配項並使用 Python 代碼丟棄不需要的匹配項。 不需要正則表達式,但可以更精確地使用它。

您的正則表達式捕獲了不需要的數據,因為您不必要地將任何字符與.*? ,當您實際上想要將匹配限制為:

  1. 一組單詞(每個單詞由一個空格分隔)
  2. 由一串數字表示的數量,可能包含逗號並可能用括號括起來

只有您關心的線條才真正遵循此模式。

考慮一下:

^ *((?:\S+ )+) {2,}(\(?[0-9,]+\)?).*$

查看@ Regex101.com

上述正則表達式通過以下方式改進了模式匹配:

  1. 顯式匹配行首^和行尾$以防止匹配多行。
  2. 使用非捕獲組匹配一個或多個后跟一個空格的單詞: (?:\\S+ )+
  3. 將非空白字符與\\S匹配以捕獲“單詞”標點符號(例如: )。
  4. 選擇性地僅匹配一個或多個數字和逗號的組合,可選地用\\(?[0-9,]+\\)?

但即使這樣也會返回不需要的列標題“Notes”和“2019”。 您可以使用否定前瞻... (?!Notes) ...來防止匹配包含“Notes”的行。

最終解決方案:

^ *((?:(?!Notes)\S+ )+) {2,}((?[0-9,]+)?).*$
查看@ Regex101.com

您可能會發現將其視為語法圖很有教育意義: 建議的最終解決方案的正則表達式語法圖 查看@ RegExper.com

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM