简体繁体 English

如何标记没有空白分隔符的连续单词？

[英]How to tokenize continuous words with no whitespace delimiters?

原文 2013-07-14 06:42:30 6 2 python/ nltk/ tokenize

I'm using Python with nltk. 我正在使用Python和nltk。 I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. 我需要用英语处理一些没有任何空格的文本，但nltk中的word_tokenize函数无法处理这样的问题。 So how to tokenize text without any whitespace. 那么如何在没有任何空格的情况下标记文本。 Is there any tools in Python? Python中有任何工具吗？

2 个解决方案

I am not aware of such tools, but the solution of your problem depends on the language. 我不知道这些工具，但问题的解决方案取决于语言。

For the Turkish language you can scan input text letter by letter and accumulate letters into a word. 对于土耳其语，您可以逐字母扫描输入文本并将字母累积为单词。 When you are sure that accumulated word forms a valid word from a dictionary, you save it as a separate token, erase the buffer for accumulating new word and continue the process. 当您确定累积的单词从字典中形成有效单词时，将其另存为单独的标记，擦除缓冲区以累积新单词并继续该过程。

You can try this for English, but I assume that you may find situations when ending of one word may be a beginning of some dictionary word, and this can cause you some problems. 您可以尝试使用英语，但我认为您可能会发现一个单词的结尾可能是某个字典单词的开头，这可能会导致您遇到一些问题。

maybe the Viterbi algorithm could help? 也许维特比算法可以帮助？ No certainties... but likely better than doing it manually. 没有确定性......但可能比手动更好。

This answer to another SO question (and the other high-vote answer) could help: https://stackoverflow.com/a/481773/583834 对另一个SO问题（以及其他高投票答案）的回答可能有所帮助： https ： //stackoverflow.com/a/481773/583834