I am trying to split a phrase into words.
Input is:
Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta
Expected output is:
['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale, Magistrale ', 'Italia ', 'Alta']
I successfully managed to split the phrase into words when there are capital letters, however I realized that some of them have a comma in between (and I actually need to keep that together) so I would like to exclude the splitting in that case. I am just starting with this kind of stuff and this re.split function is really confusing.
I have tried the following:
re.split('(?=[A-Z])+|[,]',text )
However this doesn't do exactly what I what, because it separates also when it encounters a comma, but I am trying to do the opposite.
How can I do that?
TL;DR: Split by space with look-arounds: re.compile(r'(?<,?) (.=[AZ])').split(text)
, details in section 4.
Try your regex on regex101 or regexpl.net (with Python flavour).
See the demo :
\b(?<!,)[A-Z]
Uses:
\b
word-boundary (a whitespace or comma or dot, etc.) (?<,,)
negative lookbehind : the following should not be preceeded by a comma [AZ]
character-range of uppercase letters (AZ) In Python:
import re
text = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"
words = re.split(r'\b(?<!,)[A-Z]', text)
print(words)
Prints:
['', 'arie ', "utto l'anno ", 'essuna ', 'riennale,Magistrale ', 'talia ', 'lta']
OK, some empty-string in the result.
Whoops, the capital letters got swallowed... let's fix the splitter. It should not include the [AZ] as delimiter:
In Python, how do I split a string and keep the separators?
re.findall(r'\b(?<!,)[A-Z][^A-Z]*', text)
Prints:
['Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,', 'Italia ', 'Alta']
But swallows all after the comma (eg the Magistrale
is missing in result Triennale,
instead expected Triennale,Magistrale
).
Split a string at uppercase letters
with both lookbehind and lookahead and NUL special-char \0
as delimiter.
re.sub(r'(?<!,)\b(?=[A-Z])', '\0', text).split('\0')
Prints:
['', 'Varie ', "Tutto l'anno ", 'Nessuna ', 'Triennale,Magistrale ', 'Italia ', 'Alta']
Still some empty-string in the result.
Python regex lookbehind and lookahead
re.compile(r'(?<!,) (?=[A-Z])').split(text)
Prints:
['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']
Now all spaces are swallowed (eg results in Varie
instead Varie
) because this was the delimiter.
I firmly believe that what I have come up with is not the best idea, but a solution to your problem. This is what I came up with:
import re
input = "Varie Tutto l'anno Nessuna Triennale,Magistrale Italia Alta"
output = re.sub('(\s|^)([A-Z][^A-Z]*)($|\s)', r'\n\2\n',input)
output = re.split("\n", output)
output = list(filter(None, output))
['Varie', "Tutto l'anno", 'Nessuna', 'Triennale,Magistrale', 'Italia', 'Alta']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.