简体   繁体   中英

python split string into strings with same language characters

I want split strings like "hiسلامaliعلی" into ["hi", "سلام", "ali", "علی"] .

the initial string contains only english and persian characters (with or without space) and I want to split it into continues same language characters.

is there an easy way to extract continues english character from string and split remaingin characters?

You can split on ASCII letters with re.split() :

re.split(r'([a-zA-Z]+)', inputstring)

Demo with Python 3:

>>> inputstring = "hiسلامaliعلی"
>>> re.split(r'([a-zA-Z]+)', inputstring)
['', 'hi', 'سلام', 'ali', 'علی']

Extending this to the full Latin-1 range:

re.split(r'([a-zA-Z\xC0-\xFF]+)', inputstring)

For Python 2, do make sure you use unicode strings and prefix the regular expression with u :

re.split(ur'([a-zA-Z\xC0-\xFF]+)', inputstring)

In all cases, if the Latin text is at the start or end, an empty string is inserted as the string is split; you can remove these with:

result = [s for s in re.split(r'([a-zA-Z\xC0-\xFF]+)', inputstring) if s]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM