[英]Split string by comma and space or space
I have two example strings, which I would like to split by either ", " (if, is present) or " ".我有两个示例字符串,我想用“、”(如果存在)或“”来分割它们。
x = ">Keratyna 5, egzon 2, Homo sapiens"
y = ">101m_A mol:protein length:154 MYOGLOBIN"
The split should be performed just once to recover two pieces of information:拆分应该只执行一次以恢复两条信息:
id, description = re.split(pattern, string, maxsplit=1)
For ">Keratyna 5, egzon 2, Homo sapiens" -> [">Keratyna 5", "egzon 2, Homo sapiens"]
For
">Keratyna 5, egzon 2, Homo sapiens" -> [">Keratyna 5", "egzon 2, Homo sapiens"]
For ">101m_A mol:protein length:154 MYOGLOBIN" -> [">101m_A", "mol:protein length:154 MYOGLOBIN"]
对于
">101m_A mol:protein length:154 MYOGLOBIN" -> [">101m_A", "mol:protein length:154 MYOGLOBIN"]
I came up with the following patterns: ",\\s+|\\s+", ",\\s+|^,\\s+", "[,]\\s+|[^,]\\s+"
, but none of these work.我想出了以下模式:
",\\s+|\\s+", ",\\s+|^,\\s+", "[,]\\s+|[^,]\\s+"
,但是这些都不起作用。
The solution I made is using an exception:我提出的解决方案是使用异常:
try:
id, description = re.split(",\s+", description, maxsplit=1)
except ValueError:
id, description = re.split("\s+", description, maxsplit=1)
but honestly I hate this workaround.但老实说,我讨厌这种解决方法。 I haven't found any suitable regex pattern yet.
我还没有找到任何合适的正则表达式模式。 How should I do it?
我该怎么做?
You can use您可以使用
^((?=.*,)[^,]+|\S+)[\s,]+(.*)
See the regex demo .请参阅正则表达式演示。 Details :
详情:
^
- start of string ^
- 字符串的开头((?=.*,)[^,]+|\S+)
- Group 1: if there is a ,
after any zero or more chars other than line break chars as many as possible, then match one or more chars other than ,
, or match one or more non-whitespace chars ((?=.*,)[^,]+|\S+)
- 第 1 组:如果有 a ,
则在除换行符之外的任何零个或多个字符之后尽可能多地匹配除换行符之外的一个或多个字符,
, 或匹配一个或多个非空白字符[\s,]+
- zero or more commas/whitespaces [\s,]+
- 零个或多个逗号/空格(.*)
- Group 2: zero or more chars other than line break chars as many as possible (.*)
- 第 2 组:除换行符之外的零个或多个字符尽可能多See the Python demo :请参阅Python 演示:
import re
pattern = re.compile( r'^((?=.*,)[^,]+|\S+)[\s,]+(.*)' )
texts = [">Keratyna 5, egzon 2, Homo sapiens", ">101m_A mol:protein length:154 MYOGLOBIN"]
for text in texts:
m = pattern.search(text)
if m:
id, description = m.groups()
print(f"ID: '{id}', DESCRIPTION: '{description}'")
Output: Output:
ID: '>Keratyna 5', DESCRIPTION: 'egzon 2, Homo sapiens'
ID: '>101m_A', DESCRIPTION: 'mol:protein length:154 MYOGLOBIN'
[Doesn't satisfy question] You just have to check if a comma is in the string [不满足问题]你只需要检查字符串中是否有逗号
def split(n):
if ',' in n:
return n.split(', ')
return n.split(' ')
You could either split on the first occurrence of ,
or split on a space that is no occurrence of ,
to the right using an alternation:您可以在第一次出现时拆分
,
或者在没有出现的空格上拆分,
使用交替向右:
, | (?!.*?, )
The pattern matches:模式匹配:
,
Match ,
,
匹配,
|
Or(?.?*,, )
Negative lookahead, assert that to the right is not ,
(?.?*,, )
负前瞻,断言右边不是,
See a Python demo and a regex demo .请参阅Python 演示和正则表达式演示。
Example例子
import re
strings = [
">Keratyna 5, egzon 2, Homo sapiens",
">101m_A mol:protein length:154 MYOGLOBIN"
]
for s in strings:
print(re.split(r", | (?!.*?, )", s, maxsplit=1))
Output Output
['>Keratyna 5', 'egzon 2, Homo sapiens']
['>101m_A', 'mol:protein length:154 MYOGLOBIN']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.