在Python中搜索Unicode字符

Question

I'm working on a NLP project based on Python/NLTK with non-english unicode text. 我正在基于Python / NLTK的非英语unicode文本的NLP项目。 For that, I need to search unicode string inside a sentence. 为此，我需要在句子中搜索unicode字符串。

There is a .txt file saved with some non-english unicode sentences. 有一个.txt文件，其中保存了一些非英语的unicode句子。 Using NLTK PunktSentenceTokenizer i broke them and saved in a python list. 我使用NLTK PunktSentenceTokenizer破坏了它们并将其保存在python列表中。

sentences = PunktSentenceTokenizer().tokenize(text)

Now i can iterate through list and get each sentence separately. 现在我可以遍历列表并分别获取每个sentence 。

What i need to do is go through that sentence and identify which word has the given unicode characters. 我需要做的是遍历该sentence并确定哪个单词具有给定的unicode字符。

Example - 范例-

sentence = 'AASFG BBBSDC FEKGG SDFGF'

Assume above text is non-english unicode and i need to find words ending with GF then return whole word (may be index of that word). 假设上面的文本是非英语unicode，我需要找到以GF结尾的单词，然后返回整个单词（可能是该单词的索引）。

search = 'SDFGF'

Similarly i need to find words starting with BB get the word of it. 同样，我需要找到以BB开头的单词。

search2 = 'BBBSDC'

Answer 1

If I understand correctly, you just have to split up the sentence into words, loop over each one and check if it ends or starts with the required characters, eg: 如果我理解正确，则只需将句子拆分成单词，循环遍历每个单词，然后检查它是否以所需的字符结尾或以开头，例如：

>>> sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']
>>> [word for word in sentence.split() if word.endswith("GF")]
['SDFGF']

sentence.split() could probably be replaced with something like nltk.tokenize.word_tokenize(sentence) sentence.split()也许可以用类似替代nltk.tokenize.word_tokenize(sentence)

Update , regarding comment: 更新，关于评论：

How can get word in-front of that and behind it 如何在其前面和后面获得单词

The enumerate function can be used to give each word a number, like this: enumerate函数可用于为每个单词赋予一个数字，如下所示：

>>> print list(enumerate(sentence))
[(0, 'AASFG'), (1, 'BBBSDC'), (2, 'FEKGG'), (3, 'SDFGF')]

Then if you do the same loop, but preserve the index: 然后，如果执行相同的循环，但保留索引：

>>> results = [(idx, word) for (idx, word) in enumerate(sentence) if word.endswith("GG")]
>>> print results
[(2, 'FEKGG')]

..you can use the index to get the next or previous item: ..您可以使用索引获取下一个或上一个项目：

>>> for r in results:
...     r_idx = r[0]
...     print "Prev", sentence[r_idx-1]
...     print "Next", sentence[r_idx+1]
...
Prev BBBSDC
Next SDFGF

You'd need to handle the case where the match the very first or last word ( if r_idx == 0 , if r_idx == len(sentence) ) 您需要处理匹配第一个或最后一个单词的情况（ if r_idx == 0 ， if r_idx == len(sentence) ）

在Python中搜索Unicode字符

问题描述

1 个解决方案

解决方案1
1 已采纳 2013-08-04 12:53:48

在Python中搜索Unicode字符

问题描述

1 个解决方案

解决方案1 1 已采纳 2013-08-04 12:53:48

解决方案1
1 已采纳 2013-08-04 12:53:48