简体   繁体   中英

Split string at capital letter but only if no whitespace

Set-up

I've got a string of names which need to be separated into a list.

Following this answer , I have,

string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
re.findall('[A-Z][a-z]*', string)

where the last line gives me,

['Kreuzberg', 'Lichtenberg', 'Neuk', 'Prenzlauer', 'Berg']

Problems

1) Whitespace is ignored

'Prenzlauer Berg' is actually 1 name but the code splits according to the 'split-at-capital-letter' rule.

What is the command ensuring it to not split at a capital letter if preceding character is a whitespace?

2) Special characters not handled well

The code used cannot handle 'ö' . How do I include such 'German' characters?

Ie I want to obtain,

['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

You can use positive and negative lookbehind and just list the Umlauts explicitly:

>>> string = 'KreuzbergLichtenbergNeuköllnPrenzlauer Berg'
>>> re.findall('(?<!\s)[A-ZÄÖÜ](?:[a-zäöüß\s]|(?<=\s)[A-ZÄÖÜ])*', string)
['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

(?<!\\s)... : matches ... that is not preceded by \\s

(?<=\\s)... : matches ... that is preceded by \\s

(?:...) : non-capturing group so as to not mess with the findall results

This works

string="KreuzbergLichtenbergNeuköllnPrenzlauer Berg"
pattern="[A-Z][a-ü]+\s[A-Z][a-ü]+|[A-Z][a-ü]+"
re.findall(pattern, string)
#>>>['Kreuzberg', 'Lichtenberg', 'Neukölln', 'Prenzlauer Berg']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM