如何使用python从文本中提取单词？

Question

I need to extract the words and phrases within a text. 我需要提取文本中的单词和短语。 For example, the text is: 例如，文本为：

Привет, hello, как дела? english word, еще одно русское слово, слово-1224, тест 4456

And script should return the following: 并且脚本应返回以下内容：

Привет
как
дела
еще
одно
русское
слово
слово-1224

That is, I need to take from the text of all the words that begin with the Russian letters ( [а-яА-Яё-] ), and can contain numbers and letters of the Russian alphabet. 也就是说，我需要从所有以俄语字母（ [а-яА-Яё-] ）开头的单词的文本中[а-яА-Яё-] ，并且可以包含俄语字母的数字和字母。 How is this implemented? 如何实施？

Answer 1

It was a little bit trickier than I thought. 这比我想的要棘手。 Have never used cyrrilic chars. 从未使用过西里尔字符。 I do believe this should do: 我确实认为这应该做到：

text =  # Set you're input unicode string here.
words = re.findall('[\p{IsCyrillic}][0-9\p{IsCyrillic}]+', text)

for word in words:
    print word

如何使用python从文本中提取单词？

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-03-11 08:05:06

如何使用python从文本中提取单词？

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-03-11 08:05:06

解决方案1
1 已采纳 2013-03-11 08:05:06