简体   繁体   中英

Pythonic sentence splitting on words starting with capital letter

I have couple of sentences in UTF that I want to split based on first capital letter.

Examples:

"Tough Fox" -> "Tough", "Fox"

"Nice White Cat" -> "Nice", "White Cat"

"This is a lazy Dog" -> "This is a lazy", "Dog"

"This is hardworking Little Ant" -> "This is hardworking", "Little Ant"

What is pythonic way to do such splitting?

I would use re:

>>> import re
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
...   print re.findall("[A-Z][^A-Z]*", i)
... 
['Tough ', 'Fox']
['Nice ', 'White ', 'Cat']
['This is a lazy ', 'Dog']

Edit: Okay, I thought that was a mistake. So now I am a little late, and re.split(..., s, maxsplit=1) is imho the best way, but you could still do it without maxsplit:

>>> for i in l:
...   print re.findall("^[^ ]*|[A-Z].*", i)
... 
['Tough', 'Fox']
['Nice', 'White Cat']
['This', 'Dog']

If you want to split a string on each capital letter following a whitespace

import re

s = "Tough Fox"
re.split(r"\s(?=[A-Z])", s, maxsplit=1)

['Tough', 'Fox']

The re.split method is equivalent to the Python builtin str.split , but allows a regular expression to be used as split pattern.

The regex first looks for a whitespace ( \\s ) as the split pattern. This pattern will be eaten by the re.split operation.

The (?=...) part tells is a read-ahead predicate expression. The next character(s) in the string must match this predicate (in this case any capital letter, [AZ] ). However, this part is not considered part of the match, so it will not be eaten by the re.split operation.

The maxsplit=1 will make sure that only one split (maximum two items) occur.

Maybe like this:

In [1]: import re

In [2]: def split(s):
   ...:     return re.split(r'\W(?=[A-Z])', s, 1)
   ...:

In [3]: l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]

In [4]: for s in l:
   ...:     print(split(s))
   ...:
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']

Use re.split() with a limit:

 space_split = re.compile(r'\s+(?=[A-Z])')
 result = space_split.split(inputstring, 1)

Demo:

>>> import re
>>> space_split = re.compile(r'\s+(?=[A-Z])')
>>> l = ["Tough Fox", "Nice White Cat", "This is a lazy Dog" ]
>>> for i in l:
...     print space_split.split(i, 1)
... 
['Tough', 'Fox']
['Nice', 'White Cat']
['This is a lazy', 'Dog']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM