Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them

Question

I am trying to split a phrase into words.

Example

Input is:

Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta

Expected output is:

['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale, Magistrale ', 'Italia ', 'Alta']

Disclaimer

I successfully managed to split the phrase into words when there are capital letters, however I realized that some of them have a comma in between (and I actually need to keep that together) so I would like to exclude the splitting in that case. I am just starting with this kind of stuff and this re.split function is really confusing.

What I tried

I have tried the following:

re.split('(?=[A-Z])+|[,]',text )

However this doesn't do exactly what I what, because it separates also when it encounters a comma, but I am trying to do the opposite.

How can I do that?

Answer 1

TL;DR: Split by space with look-arounds: re.compile(r'(?<,?) (.=[AZ])').split(text) , details in section 4.

0. Try your regex

Try your regex on regex101 or regexpl.net (with Python flavour).

See the demo :

\b(?<!,)[A-Z]

Uses:

\b word-boundary (a whitespace or comma or dot, etc.)
(?<,,) negative lookbehind : the following should not be preceeded by a comma
[AZ] character-range of uppercase letters (AZ)

1. Split

In Python:

import re

text = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"

words = re.split(r'\b(?<!,)[A-Z]', text)
print(words)

Prints:

['', 'arie ', "utto l'anno ", 'essuna ', 'riennale,Magistrale ', 'talia ', 'lta']

OK, some empty-string in the result.

Whoops, the capital letters got swallowed... let's fix the splitter. It should not include the [AZ] as delimiter:

In Python, how do I split a string and keep the separators?

2. Find all

re.findall(r'\b(?<!,)[A-Z][^A-Z]*', text)

Prints:

['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,', 'Italia ', 'Alta']

But swallows all after the comma (eg the Magistrale is missing in result Triennale, instead expected Triennale,Magistrale ).

Split a string at uppercase letters

3. Substitute and split

with both lookbehind and lookahead and NUL special-char \0 as delimiter.

re.sub(r'(?<!,)\b(?=[A-Z])', '\0', text).split('\0')

Prints:

['', 'Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,Magistrale ', 'Italia ', 'Alta']

Still some empty-string in the result.

Python regex lookbehind and lookahead

4. Split as intended

re.compile(r'(?<!,) (?=[A-Z])').split(text)

Prints:

['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']

Now all spaces are swallowed (eg results in Varie instead Varie ) because this was the delimiter.

Answer 2

I firmly believe that what I have come up with is not the best idea, but a solution to your problem. This is what I came up with:

import re
input = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"
output = re.sub('(\s|^)([A-Z][^A-Z]*)($|\s)', r'\n\2\n',input)
output = re.split("\n", output)
output = list(filter(None, output))

output

['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']

Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them

Question

Example

Disclaimer

What I tried

2 answers

solution1
0 2022-02-15 21:16:41

0. Try your regex

1. Split

2. Find all

3. Substitute and split

4. Split as intended

solution2
0 2022-02-15 21:21:34

output

Python splitting a phrase into words if there is a capital letter, but not if there is a comma between them

Question

Example

Disclaimer

What I tried

2 answers

solution1 0 2022-02-15 21:16:41

0. Try your regex

1. Split

2. Find all

3. Substitute and split

4. Split as intended

solution2 0 2022-02-15 21:21:34

output

solution1
0 2022-02-15 21:16:41

solution2
0 2022-02-15 21:21:34