简体   繁体   中英

Matching words at different positions with regex

I'm trying to extract a string pattern like described below.

What I have                    = What I need

FII JS REAL CI  ER 10 115,06   = FII JS REAL CI
IRBBRASIL RE ONNM 100 19,10    = IRBBRASIL RE ON
MAGAZ LUIZA ONNM 40 38,00      = MAGAZ LUIZA ON
IT NOW IBOV CI 3 103,14        = IT NOW IBOV CI
01/00 IT NOW IBOV CI 40 103,13 = IT NOW IBOV CI
ITAUSA PN  EDJ N1 1.600 12,14  = ITAUSA PN
FII JS REAL CI # 10 120,00     = FII JS REAL CI
MAGAZ LUIZA PNNM 40 38,00      = MAGAZ LUIZA PN
01/00 PETROLEO BRA PN 30 14,48 = PETROLEO BRA PN
01/00 PETROLEO BRA PN          = PETROLEO BRA PN
AMBEV S/A ON                   = AMBEV S/A ON

When I try

wordRegex = re.compile(r'(.*)((O|P)N|UNT|CI)')
word = wordRegex.search(cell_value).group(0).strip() 

It works in almost every case, except the ones where I have number/dates at the beginning. If I try something like (?<=\/\d{2}\s)(.*)((O|P)N|UNT|CI) it will (obviously) only work in such cases.

I need help to figure out a pattern that works every time.

You might also start the match with any character except for chars A-Za-z and use a capture group starting with an uppercase char AZ ending the match with one of the alternatives.

^[^A-Za-z]*([A-Z/]+(?:\s+[A-Z/]+)*\s+(?:[OP]N|UNT|CI))

The pattern matches:

  • ^ Start of string
  • [^A-Za-z]* Optionally match any char except a char AZ az
  • ( Capture group 1
    • [AZ/]+ Match 1+ times a char AZ or /
    • (?:\s+[AZ/]+)* Optionally repeat matching 1+ whitespace chars and a char AZ or /
    • \s+(?:[OP]N|UNT|CI) Match 1+ whitespace chars and one of the alternatives
  • ) Close group 1

See a regex demo or a Python demo .

For example

import re

strings = [
    "FII JS REAL CI  ER 10 115,06",
    "IRBBRASIL RE ONNM 100 19,10",
    "MAGAZ LUIZA ONNM 40 38,00",
    "IT NOW IBOV CI 3 103,14",
    "01/00 IT NOW IBOV CI 40 103,13",
    "ITAUSA PN  EDJ N1 1.600 12,14",
    "FII JS REAL CI # 10 120,00",
    "MAGAZ LUIZA PNNM 40 38,00",
    "01/00 PETROLEO BRA PN 30 14,48",
    "01/00 PETROLEO BRA PN",
    "AMBEV S/A ON"
]

pattern = r"^[^A-Za-z]*([A-Z/]+(?:\s+[A-Z/]+)*\s+(?:[OP]N|UNT|CI))"
for s in strings:
    m = re.match(pattern, s)
    if m:
        print(m.group(1))

Output

FII JS REAL CI
IRBBRASIL RE ON
MAGAZ LUIZA ON
IT NOW IBOV CI
IT NOW IBOV CI
ITAUSA PN
FII JS REAL CI
MAGAZ LUIZA PN
PETROLEO BRA PN
PETROLEO BRA PN
AMBEV S/A ON

This returns what you've listed

([A-Z][A-Z \/]*(CI|ON|PN|UNT))

See it working on regex101.com .

NOTE: The initial [AZ] is to prevent the group returned from starting with a space. If you are OK with using trim() , then you can remove it.


Update... Added Portguese characters as requested.

([A-ZÀÁÂÃÉÊÍÓÔÕÚÜ \/.]*(CI|ON|PN|UNT))

I got the Portuguese characters from this list . I manually added each one in because it's not all the characters from 192-220. If you are OK with adding in the other characters, you can do another range.

Also note... I removed the initial [AZ] because it's just messy at this point once you add in all the new characters so you'll have to .trim() the result to clean up any spaces at the start of the string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM