简体   繁体   中英

How to extract words from a text using python?

I need to extract the words and phrases within a text. For example, the text is:

Привет, hello, как дела? english word, еще одно русское слово, слово-1224, тест 4456

And script should return the following:

Привет
как
дела
еще
одно
русское
слово
слово-1224

That is, I need to take from the text of all the words that begin with the Russian letters ( [а-яА-Яё-] ), and can contain numbers and letters of the Russian alphabet. How is this implemented?

It was a little bit trickier than I thought. Have never used cyrrilic chars. I do believe this should do:

text =  # Set you're input unicode string here.
words = re.findall('[\p{IsCyrillic}][0-9\p{IsCyrillic}]+', text)

for word in words:
    print word

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM