简体   繁体   中英

How to match part of string using regex in python?

Here's the string :

SCOPE OF WORK: Supply &  Flensburg House, MMDA Colony,     PAN#: AAYCS8310G
installation Arumbakkam,Chennai,Tamil Nadu,
  xxxxxx

The things that will change in the string are:

Flensburg House, MMDA Colony,

and

Arumbakkam,Chennai,Tamil Nadu,

And these parts of the strings can contain alphabets , numbers , commas,#,- and _

The remaining parts of the string will remain as it is, including spacings.

Here's the regex I am using

SCOPE OF WORK: Supply &  [A-Za-z,\s]]*PAN#: [A-Z]{5}[0-9]{4}[A-Z]{1}\n    installation [A-Za-z]\n      xxxxxx

Ultimately what I need to obtain is:

Flensburg House, MMDA Colony,     
installation Arumbakkam,Chennai,Tamil Nadu,

I don't think my regex is entirely right and I need help on how to go about this.

A few things I noticed about your current pattern:

  • You are trying to match more space characters than pressent in text;
  • Your character classes for both substrings differ. There is spaces and comma missing from the 2nd one which is also only matched once. + Both are missing the # symbol and digits currently;

Assuming you need to just get these two substring in groups (excluding the trailing comma), try:

^SCOPE OF WORK: Supply &  ([\w, #-]+),\s+PAN#: [A-Z]{5}[0-9]{4}[A-Z]\s+installation ([\w, #-]+),\s+x{6}$

See an online demo


  • ^ - Start-line anchor;
  • SCOPE OF WORK: Supply & - A literal match of this substring including the two trailing spaces;
  • ([\w, #-]+) - A 1st capture group to match 1+ characters from given class where \w is shorthand for [A-Za-z0-9_] , all characters you mentioned it needs to include;
  • ,\s+PAN#: - A literal match of this substring including the trailing comma and 1+ whitespace characters;
  • [AZ]{5}[0-9]{4}[AZ] - Verification what follows is 5 uppercase letter, 4 digits and a single uppercase (no need to quantify a single character);
  • \s+installation - 1+ Whitespace characters including newline and trailing spaces upto;
  • ([\w, #-]+) - A 2nd capture group to match the same pattern as 1st group;
  • ,\s+x{6} - Match the trailing comma, 1+ whitespace characters and 6 trailing x's;
  • $ - End-line anchor.

import re

s = """SCOPE OF WORK: Supply &  Flensburg House, MMDA Colony,     PAN#: AAYCS8310G
installation Arumbakkam,Chennai,Tamil Nadu,
  xxxxxx"""
  
l = re.findall(r'^SCOPE OF WORK: Supply &  ([\w, #-]+),\s+PAN#: [A-Z]{5}[0-9]{4}[A-Z]\s+installation ([\w, #-]+),\s+x{6}$', s)

print(l)

Prints:

[('Flensburg House, MMDA Colony', 'Arumbakkam,Chennai,Tamil Nadu')]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM