简体   繁体   中英

Split a string with different condition without removing the character in python

I have a string with parameters in it:

text =  "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"

I want to remove spaces to obtain all parameters individually in the following way:

pred_res = ["Uncertain significance","PVS1=0","PS=[0, 0, 0, 0, 0]","PM=[0, 0, 0, 0, 0, 0, 0]","PP=[0, 0, 0, 0, 0, 0]","BA1=0","BS=[0, 0, 0, 0, 0]","BP=[0, 0, 0, 0, 0, 0, 0, 0]"]

So far I have used this regex pattern:

pat = re.compile('[a-z]\s[A-Z]|[0-9]\s[A-Z]|]\s[A-Z]')

But it's giving me the result in the following way where it removes characters:

res = ["Uncertain significanc","VS1=","S=[0, 0, 0, 0, 0","M=[0, 0, 0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0","A1=","S=[0, 0, 0, 0, 0","P=[0, 0, 0, 0, 0, 0, 0, 0]"]

So is there a way to prevent this and obtain the result shown in pred_res ?

You can use a lookahead to check that there is an = in the text immediately following a space.

import re
text = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
pred_res = re.split(r' (?=\w+=)', text)
print(pred_res)
# ['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

Another option could be matching all the separate parts.

\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)
  • \w+= Match 1+ word char followed by =
  • (?: Non capture group
    • \[[^][]*] match from [ till ]
    • | Or
    • [^][\s]+ Match any char except a whitespace char or char [ and ]
  • ) Close the group
  • | or
  • \w+(?: \w+)*(?= \w+=|$) Match word chars optionally repeated by a space and word chars asserting word chars followed by = or the end of the string at the right

Regex demo

import re

s = "Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]"
pattern = r"\w+=(?:\[[^][]*]|[^][\s]+)|\w+(?: \w+)*(?= \w+=|$)"

pred_res = re.findall(pattern, s)
print(pred_res)

Output

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

Use

\s+(?=[A-Z])

See regex proof .

EXPLANATION

--------------------------------------------------------------------------------
  \s+                      whitespace (\n, \r, \t, \f, and " ") (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
    [A-Z]                    any character of: 'A' to 'Z'
--------------------------------------------------------------------------------
  )                        end of look-ahead

Python code :

import re
test_str = 'Uncertain significance PVS1=0 PS=[0, 0, 0, 0, 0] PM=[0, 0, 0, 0, 0, 0, 0] PP=[0, 0, 0, 0, 0, 0] BA1=0 BS=[0, 0, 0, 0, 0] BP=[0, 0, 0, 0, 0, 0, 0, 0]'
matches = re.split(r'\s+(?=[A-Z])', test_str)
print(matches)

Results :

['Uncertain significance', 'PVS1=0', 'PS=[0, 0, 0, 0, 0]', 'PM=[0, 0, 0, 0, 0, 0, 0]', 'PP=[0, 0, 0, 0, 0, 0]', 'BA1=0', 'BS=[0, 0, 0, 0, 0]', 'BP=[0, 0, 0, 0, 0, 0, 0, 0]']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM