![](/img/trans.png)
[英]find words with capital letter as starting letter but not preceded by space
[英]Pythonic sentence splitting on words starting with capital letter
我想使用UTF中的幾個句子,並根據第一個大寫字母進行拆分。
例子:
"Tough Fox" -> "Tough", "Fox"
"Nice White Cat" -> "Nice", "White Cat"
"This is a lazy Dog" -> "This is a lazy", "Dog"
"This is hardworking Little Ant" -> "This is hardworking", "Little Ant"
pythonic進行這種拆分的方法是什么?
我會用re:
>>> import re
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
... print re.findall("[A-Z][^A-Z]*", i)
...
['Tough ', 'Fox']
['Nice ', 'White ', 'Cat']
['This is a lazy ', 'Dog']
編輯:好的,我認為那是一個錯誤。 所以現在我來晚了, re.split(..., s, maxsplit=1)
是恕我直言的最佳方法,但是如果沒有maxsplit,您仍然可以這樣做:
>>> for i in l:
... print re.findall("^[^ ]*|[A-Z].*", i)
...
['Tough', 'Fox']
['Nice', 'White Cat']
['This', 'Dog']
如果要在空格后的每個大寫字母上拆分字符串
import re
s = "Tough Fox"
re.split(r"\s(?=[A-Z])", s, maxsplit=1)
['Tough', 'Fox']
re.split
方法等效於Python內置的str.split
,但允許將正則表達式用作拆分模式。
正則表達式首先查找空白( \\s
)作為拆分模式。 此模式將被re.split
操作吃掉。
(?=...)
部分告訴您是一個預讀謂詞表達式。 字符串中的下一個字符必須與此謂詞匹配(在這種情況下為大寫字母[AZ]
)。 但是,這部分不被視為比賽的一部分,因此re.split
操作不會將其吃掉。
maxsplit=1
將確保僅發生一次拆分(最多兩項)。
可能是這樣的:
In [1]: import re
In [2]: def split(s):
...: return re.split(r'\W(?=[A-Z])', s, 1)
...:
In [3]: l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
In [4]: for s in l:
...: print(split(s))
...:
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']
使用re.split()
有一個限制:
space_split = re.compile(r'\s+(?=[A-Z])')
result = space_split.split(inputstring, 1)
演示:
>>> import re
>>> space_split = re.compile(r'\s+(?=[A-Z])')
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
... print space_split.split(i, 1)
...
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.