简体   繁体   中英

Regex: Break a string based on punctuation

The general purpose is to divide the string in two parts when a hyphen or slash is present. However, breaking the string depends on having before the slash or hyphen any of the key values. Otherwise, the string is kept as original.

For instance, the key values are composed of the following:

limitee,corp.,ltee.,co-operative,co-op,ltd.,corp,ltee,coop,ltd,co

Then the structure to break the string should be:

"ABC KINDER LTD./ KINDER ABC LTEE." --> Did not get correct with current regex
The string is broken in two because ltd. is before slash as a result "ABC KINDER LTD." is kept.

"ABC KINDER LTD/KINDER ABC LTEE." --> Did not get correct with current regex
The string is broken in two because ltd. is before slash as a result "ABC KINDER LTD" is kept.

"ABC BOOKS OF THE WORLD CORP.-LA COMPAGNIE DES LIVRES DU MONDE" --> Got correct this one in regex
The string is broken because of corp. is before hyphen. The final string is "ABC BOOKS OF THE WORLD CORP."

"ABC CO-OP DISTRICT SCHOOLS/ SCOLAIRES DISTRICTS ABC COOP" --> Did not get correct with current regex
The string "ABC CO-OP DISTRICT SCHOOLS" is kept.

"ABC PRE/SCHOOL DISRICTS" is NOT modified because it does not have any of the keywords before the slash.
This case is working as expected with the current regex.

The general rule is to have any of the keywords before the slash or hyphen to break. Otherwise, the string is kept in the original form. As a side note, the presence of keyword after the slash or hyphen does not affect the result.

I have tried with the following regex:

^(.*(?<!\w)(?:limitee|corp\.|ltee\.|co\-op|ltd\.|corp|ltee|coop|ltd)(?![A-Za-z0-9_\/])[A-Za-z0-9.,&]*?)+(?:[-/](\s*.*))?$

However, I am just getting correctly the first section of the string: ABC BOOKS OF WORLD CORP./LA COMPAGNIE DES LIVRES DU MONDE

Basically, because there are no keywords in the second part I am able to get that string correct. However, I am having troubles when having a keyword in the second section of the string (with the previous regex, the second part is taken in the first group and kept the whole string).

How can I get the first section of the string even though there is a keyword in the second section (after hyphen or slash)?

Update: I got rid of the optional for the second group. Now it is mandatory, this change is getting correct results, but not sure it is efficient:

^(.*(?<!\w)(?:limitee|corp\.|ltee\.|co\-op|ltd\.|corp|ltee|coop|ltd)(?!\w)?[A-Za-z0-9.,& ]*?)(?:[-/](\s*\w+.*))

However, this is not correct because if we have in keywords a substring, then it will not work. For instance, if adding co to the keyword list (substring of co-op) then ABC CO-OP DISTRICT SCHOOLS and ABC CO-OPERATIVE DISTRICT SCHOOLS will be converted to ABC CO which is incorrect.

Thanks:)

It looks like you need

keywords_rx = '|'.join(sorted(map(re.escape(terms), key=len, reverse=True)))
reg = r'^([^-/]*?\b({})(?!\w)[^-/]*)(?:[-/]\s*(.*))?$'.format(keywords)

See the regex demo .

The regex will look match like this:

  • ^ - start of a string
  • ([^-/]*?\b({})(?!\w)[^-/]*) - Capturing group 1:
    • [^-/]*? - any zero or more chars other than hyphen and slash, as few as possible due to the non-greedy *? quantifier
    • \b(keyword1|kwyword2|...)(?!\w) - any of the keywords as whole words
    • [^-/]* - any zero or more chars other than hyphen and slash, as many as possible due to the greedy * quantifier
  • (?:[-/]\s*(.*))? - an optional sequence of
    • [-/] - a hyphen or slash
    • \s* - 0+ whitespaces
    • (.*) - Capturing group 3: any 0 or more characters other than line break chars, as many as possible
  • $ - end of string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM