简体   繁体   中英

Search for a partial string in text and extract numbers from lines above and below the matched pattern

I am trying to write a regular expression that will search based on a string and if it founds even a partial match. I can get extract numbers from lines (2 lines) above and below the matched string or substring.

My text is:

Subtotal AED1,232.20
AED61.61
VAT
5 % Tax:
RECEIPT TOTAL: AED1.293.81

I wish to search for the word VAT and extract all numbers from two lines above and below it.

Expected output:

AED1,232.20
AED61.61
5 % 
AED1.293.81

I am able to extract the entire content but I need the numbers, AED can be dropped or ignored.

My regex is:

((.*\n){2}).*vat(.*\n.*\n.*)

Thanks in advance!

try this:

(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)\nVAT\n(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*).*\n[^0-9]*(?:[a-zA-Z:]*([0-9,.]+)[a-zA-Z:]*)

This regex can seem too complex or long, but it has better control and returns only numbers, it will be his work.

Regex Demo

You may use this regex in python :

((?:^.*\d.*\n){0,2})VAT((?:\n.*\d.*){0,2})

RegEx Demo

RegEx Details:

  • ((?:^.*\d.*\n){0,2}) : Match 2 leading lines that must contain at least a digit
  • VAT : match text VAT
  • ((?:\n.*\d.*){0,2}) : Match 2 trailing lines that must contain at least a digit

This regex is tailor-made for your input text and expected output:

r'.* (AED\d{1,3}(?:,\d{3})*\.\d{2})\n(AED\d{1,3}(?:,\d{3})*\.\d{2})\nVAT\n(\d{1,2} %) Tax:\n.* (AED\d{1,3}(?:,\d{3})*\.\d{2})'

Your required Regex

It outputs exactly the text you want, without extra words.

It also works with more than one "VAT" in your input text.

Regex Logics:

  • (AED\d{1,3}(?:,\d{3})*\.\d{2}) Match currency code and amount (in one group)
  • (\d{1,2} %) Match VAT %. Supports 1 to 2 digits. You can further enhance it to support decimal point.

Note that the proper regex for currency amount (with comma as thousand separator and exactly 2 decimal points) should be as follows:

r'\d{1,3}(?:,\d{3})*\.\d{2}'

[with (?: expr) to indicate nontagged group so that this subgroup will not be tagged as a match for your re.findall function call.]

In case your input supports currency codes other than 'AED', you can replace 'AED' with [AZ]{3} as currency code should normally be in 3-character capital letters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM