简体   繁体   中英

Regex splitting of multiple grouped delimeters

How do you group a combination of delimiters, such as 1. or 2) ?

For example, given a string like, '1. I like food. 2. She likes 2 baloons.' '1. I like food. 2. She likes 2 baloons.' , how can you separate such a sentence?

As another example, given the input

'1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'

the output should be

['3D Technical', 'Process animations', 'Explained videos', 'Product launch videos']

I tried:

a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
re.split(r'[1.2.3.,1)2)3)/]+|etc', a)

The output was:

['',
 'D Technical',
 'Process animations',
 ' Explainer videos',
 ' Product launch videos']

Here is a way to get the expected result:

import re

a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
r = [s for s in map(str.strip,re.split(r',? *[0-9]+(?:\)|\.) ?', a)) if s]

print(*r,sep='\n')
3D Technical/Process animations
Explainer videos
Product launch videos
  • The pattern r',? *[0-9]+(?:\)|\.)?' r',? *[0-9]+(?:\)|\.)?' for the separators can be broken down as follows:
    • ,? an optional trailing comma
    • * an optional space (or many) preceding the number
    • [0-9]+ a sequence of at least one digit
    • (?:\)|\.) followed by a closing parenthesis or a period. The ?: at the begining makes it a non-capturing group so that re.split doesn't include it in the output
    • ? an optional space after the parenthesis or period (you may want to remove the? or replace it with a + depending on your actual data

The output of re.split is mapped to str.strip to remove leading/trailing spaces. This is inside a list comprehension that will filter out empty strings (eg preceding the first separator)

If commas or slashes without the numbering are also used as separators, you can add that to the pattern:

def splitItems(a):
    pattern = r'/|,|(?:,? *[0-9]+(?:\)|\.) ?)'
    return [s for s in map(str.strip,re.split(pattern, a)) if s]

output:

a = '3D Technical/Process animations, Explainer videos, Product launch videos'
print(*splitItems(a),sep='\n')

3D Technical/Process animations
Explainer videos
Product launch videos


a = '1. Hello 2. Hi'
print(*splitItems(a),sep='\n')
Hello
Hi

a = "Great, what's up?! , Awesome"
print(*splitItems(a),sep='\n')
Great
what's up?!
Awesome

a = '1. Medicines2. Devices 3.Products'
print(*splitItems(a),sep='\n')
Medicines
Devices
Products

a = 'ABC/DEF/FGH'
print(*splitItems(a),sep='\n')
ABC
DEF
FGH

If your separators are a list of either-or patterns (meaning only one pattern applies consistently for a given string), then you can try them in order of precedence in a loop and return the first split that produces more than one part:

def splitItems(a):
    for pattern in ( r'(?:,? *[0-9]+(?:\)|\.) ?)', r',', r'/' ):
        result = [*map(str.strip,re.split(pattern, a))]
        if len(result)>1: break
    return [s for s in result if s]

Output:

# same as all the above and this one:

a = '1. Arrangement of Loans for Listed Corporates and their Group Companies, 2. Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc 3. Estate Planning'
print(*splitItems(a),sep='\n')

Arrangement of Loans for Listed Corporates and their Group Companies
Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc
Estate Planning

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM