[英]How to extract words from a text using python?
I need to extract the words and phrases within a text. 我需要提取文本中的单词和短语。 For example, the text is:
例如,文本为:
Привет, hello, как дела? english word, еще одно русское слово, слово-1224, тест 4456
And script should return the following: 并且脚本应返回以下内容:
Привет
как
дела
еще
одно
русское
слово
слово-1224
That is, I need to take from the text of all the words that begin with the Russian letters ( [а-яА-Яё-]
), and can contain numbers and letters of the Russian alphabet. 也就是说,我需要从所有以俄语字母(
[а-яА-Яё-]
)开头的单词的文本中[а-яА-Яё-]
,并且可以包含俄语字母的数字和字母。 How is this implemented? 如何实施?
It was a little bit trickier than I thought. 这比我想的要棘手。 Have never used cyrrilic chars.
从未使用过西里尔字符。 I do believe this should do:
我确实认为这应该做到:
text = # Set you're input unicode string here.
words = re.findall('[\p{IsCyrillic}][0-9\p{IsCyrillic}]+', text)
for word in words:
print word
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.