简体   繁体   中英

Modify python nltk.word_tokenize to exclude “#” as delimiter

I am using Python's NLTK library to tokenize my sentences.

If my code is

text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)

I get this as my output

['C', '#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

The symbols ; , . , # are considered as delimiters. Is there a way to remove # from the set of delimiters like how + isn't a delimiter and thus C++ appears as a single token?

I want my output to be

['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

I want C# to be considered as one token.

Another idea: instead of altering how text is tokenized, just loop over the tokens and join every '#' with the preceding one.

txt = "C# billion dollars; we don't own an ounce C++"
tokens = word_tokenize(txt)

i_offset = 0
for i, t in enumerate(tokens):
    i -= i_offset
    if t == '#' and i > 0:
        left = tokens[:i-1]
        joined = [tokens[i - 1] + t]
        right = tokens[i + 1:]
        tokens = left + joined + right
        i_offset += 1

>>> tokens
['C#', 'billion', 'dollars', ';', 'we', 'do', "n't", 'own', 'an', 'ounce', 'C++']

As dealing with multi-word tokenization, another way would be to retokenize the extracted tokens with NLTK Multi-Word Expression tokenizer :

mwtokenizer = nltk.MWETokenizer(separator='')
mwtokenizer.add_mwe(('c', '#'))
mwtokenizer.tokenize(tokens)

NLTK uses regular expressions to tokenize text, so you could use its regexp tokenizer to define your own regexp.

I'll create an example for you where text will be split on any space character (tab, new line, ecc) and a couple of other symbols just for instance:

>>> txt = "C# billion dollars; we don't own an ounce C++"
>>> regexp_tokenize(txt, pattern=r"\s|[\.,;']", gaps=True)
['C#', 'billion', 'dollars', 'we', 'don', 't', 'own', 'an', 'ounce', 'C++']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM