简体   繁体   中英

RegEx for ukrainian letters. How to separate cyrillic words by capital letter?

I have a String with some cyrillic words inside. Each starts with a capital letter.

var str = 'ХєлпМіПліз';

I have found this solution str.match(/[А-Я][а-я]+/g) .

But it returns me ["Пл"] insted of ["Хєлп", "Мі", "Пліз"] . Seems like it doesn't recognize ukrainian letters('і', 'є'), only russian.

So, How do I have to change that regex to include ukrainian letters?

[А-Я] is not Cyrillic alphabet, it's just Russian!

Cyrillic is a writing system. It used in alphabets for many languages. (Like Latin: charset for West European languages, East European &c.)

To have both Russian and Ukrainian you'd get [А-ЯҐЄІЇ] .

To add Belarisian: [А-ЯҐЄІЇЎ]

And for all Cyrillic chars (including Balcanian languages and Old Cyrillic), you can get it through Unicode subset class, like: \\p{IsCyrillic}


To deal with Ukrainian separately:

[А-ЩЬЮЯҐЄІЇ] or [А-ЩЬЮЯҐЄІЇа-щьюяґєії] seems to be full Ukrainian alphabet of 33 letters in each case.

Apostrophe is not a letter, but occasionally included in alphabet, because it has an impact to the next vowel. Apostrophe is a part of the words, not divider. It may be displayed in a few ways:

\n27 "'" APOSTROPHE \n60 "`" GRAVE ACCENT \n2019 "'" RIGHT SINGLE QUOTATION MARK \n2bc "ʼ" MODIFIER LETTER APOSTROPHE \n

and maybe some more.

Yes, it's a bit complicated with apostrophe. There is no common standard for it.

Use \\p{Lu} for uppercase match, \\p{Ll} for lowercase, or \\p{L} to match any letter

update: That works only for Java, not for JavaScript. Don't forget to include "apostrof", "ji" to your regexp

The way to solve this is to look at the unicode table to determine the character ranges you need. If, for example, I use the pattern:

str.match(/[А-Я][а-яєі]+/g)

it works with your example string. (sorry i don't know ukrainian letters)

[А-Я][а-я] really doesn't include ukranian letters.

While 'я' is , 'є' is and 'i' is ( for Є ) . You should include them in regex by hand:

/[А-ЯЄI][а-яєi]+/g

Ukranian alphabet has four different words from the cyrillic alphabet, such as: [і, є, ї, ґ], also it can contain a single quote inside

"ґуля, з'їсти, істота, Європа".match(/[а-яієїґ\']+/ig)

i by the and will match the upper case, like with "Європа"

works with Ukrainian letters 'i' and others

python
r's/[^а-яА-Я.!?]/./g+' 

只有乌克兰语,没有俄语

[бвгґджзклмнпрстфхцчшщйаеєиіїоуюяь]/gi

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM