简体   繁体   中英

keep trailing punctuation in python nltk.word_tokenize

There's a ton available about removing punctuation, but I can't seem to find anything keeping it.

If I do:

from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P', '.']

the last "." is pushed into its own token. However, if instead there is another word at the end, the last "." is preserved:

from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P. Another Co"
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P.', 'Another', 'Co']

I'd like this to always perform as the second case. For now, I'm hackishly doing:

from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str + " |||")

since I feel pretty confident in throwing away "|||" at any given time, but don't know what other punctuation I might want to preserve that could get dropped. Is there a better way to accomplish this ?

It is a quirk of spelling that if a sentence ends with an abbreviated word, we only write one period, not two. The nltk's tokenizer doesn't "remove" it, it splits it off because sentence structure ("a sentence must end with a period or other suitable punctuation") is more important to NLP tools than consistent representation of abbreviations. The tokenizer is smart enough to recognize most abbreviations, so it doesn't separate the period in LP mid-sentence.

Your solution with ||| results in inconsistent sentence structure, since you now have no sentence-final punctuation. A better solution would be to add the missing period only after abbreviations. Here's one way to do this, ugly but as reliable as the tokenizer's own abbreviation recognizer:

toks = nltk.word_tokenize(test_str + " .")
if len(toks) > 1 and len(toks[-2]) > 1 and toks[-2].endswith("."):
    pass # Keep the added period
else:
    toks = toks[:-1]

PS. The solution you have accepted will completely change the tokenization, leaving all punctuation attached to the adjacent word (along with other undesirable effects like introducing empty tokens). This is most likely not what you want.

Could you use re ?

import re

test_str = "Some Co Inc. Other Co L.P."

print re.split('\s', test_str)

This will split the input string based on spacing, retaining your punctuation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM