Split string by comma and space or space

Question

I have two example strings, which I would like to split by either ", " (if, is present) or " ".

x = ">Keratyna 5, egzon 2, Homo sapiens"
y = ">101m_A mol:protein length:154  MYOGLOBIN"

The split should be performed just once to recover two pieces of information:

id, description = re.split(pattern, string, maxsplit=1)

For ">Keratyna 5, egzon 2, Homo sapiens" -> [">Keratyna 5", "egzon 2, Homo sapiens"]

For ">101m_A mol:protein length:154 MYOGLOBIN" -> [">101m_A", "mol:protein length:154 MYOGLOBIN"]

I came up with the following patterns: ",\\s+|\\s+", ",\\s+|^,\\s+", "[,]\\s+|[^,]\\s+" , but none of these work.

The solution I made is using an exception:

try:
    id, description = re.split(",\s+", description, maxsplit=1)
except ValueError:
    id, description = re.split("\s+", description, maxsplit=1)

but honestly I hate this workaround. I haven't found any suitable regex pattern yet. How should I do it?

Answer 1

You can use

^((?=.*,)[^,]+|\S+)[\s,]+(.*)

See the regex demo . Details :

^ - start of string
((?=.*,)[^,]+|\S+) - Group 1: if there is a , after any zero or more chars other than line break chars as many as possible, then match one or more chars other than , , or match one or more non-whitespace chars
[\s,]+ - zero or more commas/whitespaces
(.*) - Group 2: zero or more chars other than line break chars as many as possible

See the Python demo :

import re
pattern = re.compile( r'^((?=.*,)[^,]+|\S+)[\s,]+(.*)' )
texts = [">Keratyna 5, egzon 2, Homo sapiens", ">101m_A mol:protein length:154  MYOGLOBIN"]
for text in texts:
    m = pattern.search(text)
    if m:
        id, description = m.groups()
        print(f"ID: '{id}', DESCRIPTION: '{description}'")

Output:

ID: '>Keratyna 5', DESCRIPTION: 'egzon 2, Homo sapiens'
ID: '>101m_A', DESCRIPTION: 'mol:protein length:154  MYOGLOBIN'

Answer 2

[Doesn't satisfy question] You just have to check if a comma is in the string

def split(n):
    if ',' in n:
        return n.split(', ')
    return n.split(' ')

Answer 3

You could either split on the first occurrence of , or split on a space that is no occurrence of , to the right using an alternation:

, | (?!.*?, )

The pattern matches:

, Match ,
| Or
(?.?*,, ) Negative lookahead, assert that to the right is not ,

See a Python demo and a regex demo .

Example

import re

strings = [
    ">Keratyna 5, egzon 2, Homo sapiens",
    ">101m_A mol:protein length:154  MYOGLOBIN"
]

for s in strings:
    print(re.split(r", | (?!.*?, )", s, maxsplit=1))

Output

['>Keratyna 5', 'egzon 2, Homo sapiens']
['>101m_A', 'mol:protein length:154  MYOGLOBIN']

Split string by comma and space or space

Question

2 answers

solution1
1 2022-01-08 21:04:22

solution2
0 2022-01-08 18:08:50

solution3
0 2022-01-08 21:39:35

Split string by comma and space or space

Question

2 answers

solution1 1 2022-01-08 21:04:22

solution2 0 2022-01-08 18:08:50

solution3 0 2022-01-08 21:39:35

solution1
1 2022-01-08 21:04:22

solution2
0 2022-01-08 18:08:50

solution3
0 2022-01-08 21:39:35