简体   繁体   中英

Split this string using regular expression - python

Input string
---------------
South Africa 109/0 
Australia 100
Sri Lanka 111
Sri Lanka 331/4

Expected Output
---------------
['South Africa', '109', '0']
['Australia', '100']
['Sri Lanka', '111']
['Sri Lanka', '331', '4']

I tried several regex, but couldn't figure out to write the correct one. Space delimiter doesnt help me in this case as the country names may or may not have spaces (South Africa, India). Thanks in Advance

We could use the regex:

r'(\D+)\s(\d+)(?:/(\d+))?'

("a lot of non-digits, followed by a space, followed by a lot digits, and then optionally followed by a slash and then a lot of digits.")

This will return, eg

>>> [re.match(r'(\D+)\s(\d+)(?:/(\d+))?', x).groups() 
...  for x in ['South Africa 109/0', 
...            'Australia 100',
...            'Sri Lanka 111',
...            'Sri Lanka 331/4']]
[('South Africa', '109', '0'), 
 ('Australia', '100', None), 
 ('Sri Lanka', '111', None), 
 ('Sri Lanka', '331', '4')]

Notice the None s, which you may need to filter out manually.

Try:

import re
re.split(r"(?<=[a-zA-Z])\s+(?=\d)|(?=\d)\s+(?=[a-zA-Z])|/", "South Africa 109/0")
re.compile("^([\w\s]+)\s(\d+)\/?(\d+)?")

gives you the three groups. We can decompose it

  • A group of only letters and space ([\\w\\s]+) at the beggining of the line ( ^ )
  • a space
  • a group of digits, at least one (\\d+)
  • a / or not
  • a group of digits (potententially None )

This is the regex you need:

for match in re.finditer(r"(?m)^(?P<Country>.*?)\s*(?P<Number1>\d+)\s*?/?\s*?(?P<Number2>\d*?)\s*?$", inputText):
    country = match.group("Country")
    number1 = match.group("Number1")
    number2 = match.group("Number2")

You can see the results here .

And here's the explanation of the pattern:

# ^(?P<Country>.*?)\s*(?P<Number1>\d+)\s*?/?\s*?(?P<Number2>\d*?)\s*?$
# 
# Options: ^ and $ match at line breaks
# 
# Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
# Match the regular expression below and capture its match into backreference with name “Country” «(?P<Country>.*?)»
#    Match any single character that is not a line break character «.*?»
#       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*»
#    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the regular expression below and capture its match into backreference with name “Number1” «(?P<Number1>\d+)»
#    Match a single digit 0..9 «\d+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match the character “/” literally «/?»
#    Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match the regular expression below and capture its match into backreference with name “Number2” «(?P<Number2>\d*?)»
#    Match a single digit 0..9 «\d*?»
#       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert position at the end of a line (at the end of the string or before a line break character) «$»

You've got the answers with regex, but I suggest also considering the available builtin str methods (for this use case anyway):

s = 'South Africa 109/0'
country, numbers = s.rsplit(' ', 1)
# ('South Africa', '109/0')
new_list = [country] + numbers.split('/')
# ['South Africa', '109', '0'] 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM