简体   繁体   中英

Python: Extracting multiple lines between RegEx Matches

Good evening,

I am converting PDF into CSV using python and is using RegEx to extract the information.

The raw text, after extracting text from PDF, could look like this:

Account Transaction Details
Twin Account   123-456-789-1
Date Description Withdrawals Deposits Balance
01 Jan BALANCE B/F 123,456.78  
03 Jan Funds Transfer 195.04 123,456.78  
mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78  
WIRE OTHR
ANTON HARLEY
Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78  
PIB8452145632845963
Abricot 480
OTHR Transfer

I used a RegEx [0-3]{1}[0-9]{1}\s[AZ]{1}[az]{2}\s[?A-Za-z]{1,155} and managed to get the needed transactions:

01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
03 Jan Funds Trf - SPEED 3,500.00 123,345.78

However, the additional information between the matches had been dropped because I have split the text using \n and then running the RegEx.

How do I code such that I get the additional information that is in-between the RegEx matches, and the additional info is tagged to the previous match? This is my envisaged output:

01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78 OTHR ANTON HARLEY Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer

Edit:

I have adapted @dcsuka solution and have gotten the following:

06 Jan Debit-Consumer 12.60 123,456.78   SNIP AVENU13568100 4265884035605848

06 Jan Inward DR - 828.24 123,456.78   SHIP G12345HUJ ITX

07 Jan Funds Transfer 50.00 123,456.78   Pleasenotethatyouareboundbyadutyundertherulesgoverningtheoperationofthisaccount,tochecktheentriesintheabovestatement. Ifyoudonotnotifyusinwritingofanyerrors, omissionsorunauthoriseddebitswithinfourteen(14)daysofthisstatement,theentriesaboveshallbedeemedvalid,correct,accurateandconclusivelybindinguponyou,andyoushallhaveno claim against the bank in relation thereto. XYZ Ltd  •  80 QuincyPlace ABC Plaza XXX 12345  •  Co. Reg. No. 1234567890Z  •  GST Reg. No. YY-8121234-2  •   www.xyzabc.com

07 Jan Inward CR - SPEED 9,092.06 123,456.78   SALAD SALAS Payment CARL QWE 817264950

How do I remove the excess words " Pleasenotethatyouareboundbyadut... " The only pattern I can see is that it would be a very long word (probably more than 20 characters). Is that the way to go?

Edit2:

@dcsuka had adjusted the code to aid in the removal of 'noise' by based on words or more than 20 characters. Thank you dcsuka!

You can try using a positive lookahead for a number after newline when you split the string, to get bigger chunks more reflective of your expected output:

import re

split_text = re.split("\n(?=\d{1,3}\s)", text1)

[" ".join(i.split()) for i in split_text if re.search("^\d\d\s", i)]

# ['01 Jan BALANCE B/F 123,456.78',
#  '03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690',
#  '03 Jan Inward Credit-QUICK 3,000.84 123,456.78 WIRE OTHR ANTON HARLEY Other',
#  '03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer']

I have attempted to look at it again after I have gained more knowledge on regex.

Like what @dcsuka suggested, I would need to use a positive lookahead (so that my regex does not consume the 'quantifier' that I set at the end)

This was the code I used:

(^[0-9]{2}) ([A-Z]{1}[a-z]{2}) (.*?)(?=\n[0-9]{2} [A-Z]{1}[a-z]{2}|[A-Za-z]{15,})', flags=re.M | re.S

First, I grouped them into:

  1. Date using (^[0-9]{2}) , with the '^' to indicate start of line since the date would be 2 digits (01 or 11)
  2. Month using ([AZ]{1}[az]{2}) , since the month would be Dec/ Jan/ Feb...
  3. My main capture that I wanted using (.*?) , which is description in this case
  4. Date and Month, with other description using (?=\n[0-9]{2} [AZ]{1}[az]{2}|[A-Za-z]{15,})
  5. Lastly, I used the flags for multi-line and single-line flags=re.M | re.S flags=re.M | re.S , so that the multiline merges into a single line for my regex to search.

Once done, I used re.findall(line_re) to search for all matches.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM