简体繁体中英

How to tokenize continuous words with no whitespace delimiters?

原文 2013-07-14 06:42:30 0 2 python/ nltk/ tokenize

I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?

2 answers

I am not aware of such tools, but the solution of your problem depends on the language.

For the Turkish language you can scan input text letter by letter and accumulate letters into a word. When you are sure that accumulated word forms a valid word from a dictionary, you save it as a separate token, erase the buffer for accumulating new word and continue the process.

You can try this for English, but I assume that you may find situations when ending of one word may be a beginning of some dictionary word, and this can cause you some problems.

maybe the Viterbi algorithm could help? No certainties... but likely better than doing it manually.

This answer to another SO question (and the other high-vote answer) could help: https://stackoverflow.com/a/481773/583834

How to tokenize compound words?

How to Tokenize group of words in Python

How to avoid tokenize words with underscore?

How to tokenize words and input them into another file?

How to put key-words in NLTK tokenize?

how to tokenize big text in sentences and words

How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string)

tokenize a string keeping delimiters in Python

How to delete the words between two delimiters?

How to treat certain words as delimiters in nltk Python?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to tokenize compound words? How to Tokenize group of words in Python How to avoid tokenize words with underscore? How to tokenize words and input them into another file? How to put key-words in NLTK tokenize? how to tokenize big text in sentences and words How do I split a string on different delimiters, but keeping on the output some of said delimiters? (Tokenize a string) tokenize a string keeping delimiters in Python How to delete the words between two delimiters? How to treat certain words as delimiters in nltk Python?

Related Tags

How to tokenize continuous words with no whitespace delimiters?

Question

2 answers

solution1
2 2013-07-14 07:01:36

solution2
2 2013-07-15 15:25:21

How to tokenize continuous words with no whitespace delimiters?

Question

2 answers

solution1 2 2013-07-14 07:01:36

solution2 2 2013-07-15 15:25:21

solution1
2 2013-07-14 07:01:36

solution2
2 2013-07-15 15:25:21