[英]Split string by position not character
We know that anchors
, word boundaries
, and lookaround
match at a position, rather than matching a character. 我们知道
anchors
, word boundaries
和lookaround
在某个位置匹配,而不是匹配字符。
Is it possible to split a string by one of the preceding ways with regex (specifically in python)? 是否可以使用正则表达式(特别是在python中)通过上述方式之一拆分字符串?
For example consider the following string: 例如,考虑以下字符串:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ): 所以我想要以下结果(以大写字母开头但不以空格开头的子字符串):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get: 如果我分组分组,我得到:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around : 这是环顾四周的结果:
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, eg: 请注意,如果我想按以大写开头并以空格开头的子字符串进行分割,例如:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall
, viz.: 我可以使用
re.findall
,即:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall
? 但是第一个示例呢:可以用
re.findall
解决吗?
(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions. 您可以使用它与regex模块拆分,因为re不支持在0宽度断言处拆分。
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or 要么
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo. 参见演示。
https://regex101.com/r/sJ9gM7/65 https://regex101.com/r/sJ9gM7/65
A way with re.findall
: 使用
re.findall
:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split
to findall
, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter) " 当您决定将方法从
split
更改为findall
,第一项工作是重新构成您的要求:“我想在每个大写字母上拆分字符串,但不带空格” =>“我想查找一个或多个分隔的子字符串由以大写字母开头的空格(从字符串开头除外) (如果字符串不是以大写字母开头)
I know this might be less convenient because of the tuple nature of the result. 我知道由于结果的元组性质,这可能不太方便。 But I think that this
findall
finds what you need: 但是我认为这个
findall
可以找到您需要的东西:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output: 可以在下面的列表理解中使用它来提供所需的输出:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split
: 这是一个使用
split
的hack:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.