Split a string with different condition without removing the character in python

Question

I have a string with parameters in it:

text =  "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"

I want to remove spaces to obtain all parameters individually in the following way:

pred_res = ["Uncertain significance","PVS1=0","PS=[0, 0, 0, 0, 0]","PM=[0, 0, 0, 0, 0, 0, 0]","PP=[0, 0, 0, 0, 0, 0]","BA1=0","BS=[0, 0, 0, 0, 0]","BP=[0, 0, 0, 0, 0, 0, 0, 0]"]

So far I have used this regex pattern:

pat = re.compile('[a-z]\s[A-Z]|[0-9]\s[A-Z]|]\s[A-Z]')

But it's giving me the result in the following way where it removes characters:

res = ["Uncertain significanc","VS1=","S=[0, 0, 0, 0, 0","M=[0, 0, 0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0","A1=","S=[0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0, 0, 0]"]

So is there a way to prevent this and obtain the result shown in pred_res ?

Answer 1

You can use a lookahead to check that there is an = in the text immediately following a space.

import re
text = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
pred_res = re.split(r' (?=\w+=)', text)
print(pred_res)
# ['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

Answer 2

Another option could be matching all the separate parts.

\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)

\w+= Match 1+ word char followed by =
(?: Non capture group
- \[[^][]*] match from [ till ]
- | Or
- [^][\s]+ Match any char except a whitespace char or char [ and ]
) Close the group
| or
\w+(?: \w+)*(?= \w+=|$) Match word chars optionally repeated by a space and word chars asserting word chars followed by = or the end of the string at the right

Regex demo

import re

s = "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"
pattern = r"\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)"

pred_res = re.findall(pattern, s)
print(pred_res)

Output

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

Answer 3

Use

\s+(?=[A-Z])

See regex proof .

EXPLANATION

--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  )                        end of look-ahead

Python code :

import re
test_str = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
matches = re.split(r'\s+(?=[A-Z])', test_str)
print(matches)

Results :

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

Split a string with different condition without removing the character in python

Question

3 answers

solution1
4 ACCPTED 2021-04-27 11:04:30

solution2
3 2021-04-27 11:07:04

solution3
1 2021-04-27 22:29:53

Split a string with different condition without removing the character in python

Question

3 answers

solution1 4 ACCPTED 2021-04-27 11:04:30

solution2 3 2021-04-27 11:07:04

solution3 1 2021-04-27 22:29:53

solution1
4 ACCPTED 2021-04-27 11:04:30

solution2
3 2021-04-27 11:07:04

solution3
1 2021-04-27 22:29:53