简体   繁体   English

使用正则表达式拆分此字符串-python

[英]Split this string using regular expression - python

Input string
---------------
South Africa 109/0 
Australia 100
Sri Lanka 111
Sri Lanka 331/4

Expected Output
---------------
['South Africa', '109', '0']
['Australia', '100']
['Sri Lanka', '111']
['Sri Lanka', '331', '4']

I tried several regex, but couldn't figure out to write the correct one. 我尝试了几种正则表达式,但无法弄清楚编写正确的正则表达式。 Space delimiter doesnt help me in this case as the country names may or may not have spaces (South Africa, India). 在这种情况下,空格分隔符对我没有帮助,因为国家名称可能带有空格,也可能没有空格(南非,印度)。 Thanks in Advance 提前致谢

We could use the regex: 我们可以使用正则表达式:

r'(\D+)\s(\d+)(?:/(\d+))?'

("a lot of non-digits, followed by a space, followed by a lot digits, and then optionally followed by a slash and then a lot of digits.") (“很多非数字,后跟一个空格,然后是很多数字,然后可选地,后面跟着一个斜杠,然后是很多数字。”)

This will return, eg 这将返回,例如

>>> [re.match(r'(\D+)\s(\d+)(?:/(\d+))?', x).groups() 
...  for x in ['South Africa 109/0', 
...            'Australia 100',
...            'Sri Lanka 111',
...            'Sri Lanka 331/4']]
[('South Africa', '109', '0'), 
 ('Australia', '100', None), 
 ('Sri Lanka', '111', None), 
 ('Sri Lanka', '331', '4')]

Notice the None s, which you may need to filter out manually. 注意None ,您可能需要手动将其过滤掉。

Try: 尝试:

import re
re.split(r"(?<=[a-zA-Z])\s+(?=\d)|(?=\d)\s+(?=[a-zA-Z])|/", "South Africa 109/0")
re.compile("^([\w\s]+)\s(\d+)\/?(\d+)?")

gives you the three groups. 给您三个小组。 We can decompose it 我们可以分解它

  • A group of only letters and space ([\\w\\s]+) at the beggining of the line ( ^ ) 行( ^ )开头的一组只有字母和空格([\\w\\s]+)
  • a space 空间
  • a group of digits, at least one (\\d+) 一组数字,至少一个(\\d+)
  • a / or not 一个/
  • a group of digits (potententially None ) 一组数字(可能是None

This is the regex you need: 这是您需要的正则表达式:

for match in re.finditer(r"(?m)^(?P<Country>.*?)\s*(?P<Number1>\d+)\s*?/?\s*?(?P<Number2>\d*?)\s*?$", inputText):
    country = match.group("Country")
    number1 = match.group("Number1")
    number2 = match.group("Number2")

You can see the results here . 您可以在此处查看结果。

And here's the explanation of the pattern: 这是该模式的说明:

# ^(?P<Country>.*?)\s*(?P<Number1>\d+)\s*?/?\s*?(?P<Number2>\d*?)\s*?$
# 
# Options: ^ and $ match at line breaks
# 
# Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
# Match the regular expression below and capture its match into backreference with name “Country” «(?P<Country>.*?)»
#    Match any single character that is not a line break character «.*?»
#       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*»
#    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the regular expression below and capture its match into backreference with name “Number1” «(?P<Number1>\d+)»
#    Match a single digit 0..9 «\d+»
#       Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match the character “/” literally «/?»
#    Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match the regular expression below and capture its match into backreference with name “Number2” «(?P<Number2>\d*?)»
#    Match a single digit 0..9 «\d*?»
#       Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
#    Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert position at the end of a line (at the end of the string or before a line break character) «$»

You've got the answers with regex, but I suggest also considering the available builtin str methods (for this use case anyway): 您已经用正则表达式得到了答案,但是我建议您也考虑可用的内置str方法(无论如何针对此用例):

s = 'South Africa 109/0'
country, numbers = s.rsplit(' ', 1)
# ('South Africa', '109/0')
new_list = [country] + numbers.split('/')
# ['South Africa', '109', '0'] 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM