Split string at capital letter but only if no whitespace

Question

Set-up

I've got a string of names which need to be separated into a list.

Following this answer , I have,

string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)

where the last line gives me,

['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']

Problems

1) Whitespace is ignored

'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.

What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?

2) Special characters not handled well

The code used cannot handle 'ö' . How do I include such 'German' characters?

Ie I want to obtain,

['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

Answer 1

You can use positive and negative lookbehind and just list the Umlauts explicitly:

>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

(?<!\\s)... : matches ... that is not preceded by \\s

(?<=\\s)... : matches ... that is preceded by \\s

(?:...) : non-capturing group so as to not mess with the findall results

Answer 2

This works

string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

Split string at capital letter but only if no whitespace

Question

2 answers

solution1
3 ACCPTED 2017-11-27 10:50:39

solution2
0 2017-11-27 10:56:27

Split string at capital letter but only if no whitespace

Question

2 answers

solution1 3 ACCPTED 2017-11-27 10:50:39

solution2 0 2017-11-27 10:56:27

solution1
3 ACCPTED 2017-11-27 10:50:39

solution2
0 2017-11-27 10:56:27