简体   繁体   中英

Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them

I am trying to split a phrase into words.

Example

Input is:

Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta

Expected output is:

['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale, Magistrale ', 'Italia ', 'Alta']

Disclaimer

I successfully managed to split the phrase into words when there are capital letters, however I realized that some of them have a comma in between (and I actually need to keep that together) so I would like to exclude the splitting in that case. I am just starting with this kind of stuff and this re.split function is really confusing.

What I tried

I have tried the following:

re.split('(?=[A-Z])+|[,]',text )

However this doesn't do exactly what I what, because it separates also when it encounters a comma, but I am trying to do the opposite.

How can I do that?

TL;DR: Split by space with look-arounds: re.compile(r'(?<,?) (.=[AZ])').split(text) , details in section 4.

0. Try your regex

Try your regex on regex101 or regexpl.net (with Python flavour).

See the demo :

\b(?<!,)[A-Z]

Uses:

  1. \b word-boundary (a whitespace or comma or dot, etc.)
  2. (?<,,) negative lookbehind : the following should not be preceeded by a comma
  3. [AZ] character-range of uppercase letters (AZ)

1. Split

In Python:

import re

text = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"

words = re.split(r'\b(?<!,)[A-Z]', text)
print(words)

Prints:

['', 'arie ', "utto l'anno ", 'essuna ', 'riennale,Magistrale ', 'talia ', 'lta']

OK, some empty-string in the result.

Whoops, the capital letters got swallowed... let's fix the splitter. It should not include the [AZ] as delimiter:

In Python, how do I split a string and keep the separators?

2. Find all

re.findall(r'\b(?<!,)[A-Z][^A-Z]*', text)

Prints:

['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,', 'Italia ', 'Alta']

But swallows all after the comma (eg the Magistrale is missing in result Triennale, instead expected Triennale,Magistrale ).

Split a string at uppercase letters

3. Substitute and split

with both lookbehind and lookahead and NUL special-char \0 as delimiter.

re.sub(r'(?<!,)\b(?=[A-Z])', '\0', text).split('\0')

Prints:

['', 'Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,Magistrale ', 'Italia ', 'Alta']

Still some empty-string in the result.

Python regex lookbehind and lookahead

4. Split as intended

re.compile(r'(?<!,) (?=[A-Z])').split(text)

Prints:

['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']

Now all spaces are swallowed (eg results in Varie instead Varie ) because this was the delimiter.

I firmly believe that what I have come up with is not the best idea, but a solution to your problem. This is what I came up with:

import re
input = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"
output = re.sub('(\s|^)([A-Z][^A-Z]*)($|\s)', r'\n\2\n',input)
output = re.split("\n", output)
output = list(filter(None, output))

output

['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM