简体   繁体   English

Pythonic句子拆分以大写字母开头的单词

[英]Pythonic sentence splitting on words starting with capital letter

I have couple of sentences in UTF that I want to split based on first capital letter. 我想使用UTF中的几个句子,并根据第一个大写字母进行拆分。

Examples: 例子:

"Tough Fox" -> "Tough", "Fox"

"Nice White Cat" -> "Nice", "White Cat"

"This is a lazy Dog" -> "This is a lazy", "Dog"

"This is hardworking Little Ant" -> "This is hardworking", "Little Ant"

What is pythonic way to do such splitting? pythonic进行这种拆分的方法是什么?

I would use re: 我会用re:

>>> import re
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
...   print re.findall("[A-Z][^A-Z]*", i)
... 
['Tough ', 'Fox']
['Nice ', 'White ', 'Cat']
['This is a lazy ', 'Dog']

Edit: Okay, I thought that was a mistake. 编辑:好的,我认为那是一个错误。 So now I am a little late, and re.split(..., s, maxsplit=1) is imho the best way, but you could still do it without maxsplit: 所以现在我来晚了, re.split(..., s, maxsplit=1)是恕我直言的最佳方法,但是如果没有maxsplit,您仍然可以这样做:

>>> for i in l:
...   print re.findall("^[^ ]*|[A-Z].*", i)
... 
['Tough', 'Fox']
['Nice', 'White Cat']
['This', 'Dog']

If you want to split a string on each capital letter following a whitespace 如果要在空格后的每个大写字母上拆分字符串

import re

s = "Tough Fox"
re.split(r"\s(?=[A-Z])", s, maxsplit=1)

['Tough', 'Fox']

The re.split method is equivalent to the Python builtin str.split , but allows a regular expression to be used as split pattern. re.split方法等效于Python内置的str.split ,但允许将正则表达式用作拆分模式。

The regex first looks for a whitespace ( \\s ) as the split pattern. 正则表达式首先查找空白( \\s )作为拆分模式。 This pattern will be eaten by the re.split operation. 此模式将被re.split操作吃掉。

The (?=...) part tells is a read-ahead predicate expression. (?=...)部分告诉您是一个预读谓词表达式。 The next character(s) in the string must match this predicate (in this case any capital letter, [AZ] ). 字符串中的下一个字符必须与此谓词匹配(在这种情况下为大写字母[AZ] )。 However, this part is not considered part of the match, so it will not be eaten by the re.split operation. 但是,这部分不被视为比赛的一部分,因此re.split操作不会将其吃掉。

The maxsplit=1 will make sure that only one split (maximum two items) occur. maxsplit=1将确保仅发生一次拆分(最多两项)。

Maybe like this: 可能是这样的:

In [1]: import re

In [2]: def split(s):
   ...:     return re.split(r'\W(?=[A-Z])', s, 1)
   ...:

In [3]: l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]

In [4]: for s in l:
   ...:     print(split(s))
   ...:
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']

Use re.split() with a limit: 使用re.split()有一个限制:

 space_split = re.compile(r'\s+(?=[A-Z])')
 result = space_split.split(inputstring, 1)

Demo: 演示:

>>> import re
>>> space_split = re.compile(r'\s+(?=[A-Z])')
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
...     print space_split.split(i, 1)
... 
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找以大写字母作为起始字母但前面没有空格的单词 - find words with capital letter as starting letter but not preceded by space Python 如果有大写字母,则将短语拆分为单词,但如果它们之间有逗号,则不拆分 - Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them 正则表达式查找以大写字母开头的单词,而不是在句子的开头 - Regex to find words starting with capital letters not at beginning of sentence 组合列表中的字符串以形成以大写字母开头的单词 - Combine strings in list to form words starting with capital letter 在2个大写字母(regex)之前找到以大写字母开头的n个单词 - Find n words starting with capital letter before 2 words of capital letters (regex) pythonic句子中的反词 - reverse words in a sentence pythonic Python 中的正则表达式:在以大写字母开头的两个单词之间分割空白字符 - Regex in Python: splitting on whitespace character in between two words that start with a capital letter 如何使用 python 中的 re.sub 删除字符串列表中以大写字母开头的单词 - How to remove words starting with capital letter in a list of strings using re.sub in python 使用正则表达式查找不是在句子开头的大写字母 - Find words with capital letters not at start of a sentence with regex 是否有用于将句子拆分为单词列表的库? - Is there a library for splitting sentence into a list of words in it?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM