简体   繁体   中英

NLTK TweetTokenizer incorrectly separates contractions

I have a personal Python project where I am trying to tokenize tweets. I am using NLTK's TweetTokenizer to break up these tweets. I am running into an issue where contractions incorrectly get broken up

EX "can't" -> ["can", "'", "t"]

I am struggling to find any documentation on this error. I have pasted relevant code below.

An important note is that TweetTokenizer works with strings that I hardcode into my program, however, does not work with strings that originate from Twitter

from nltk.tokenize import TweetTokenizer
def tweetsTagger(tweets): #Tokenizes and tags the tweets
    tweetsTagged = []
    for tweet in tweets: #tweet is a status object from Twitter's Tweepy API
        text = ""
        if hasattr(tweet, 'full_text'):
            text = str(tweet.full_text)
        else:
            text = str(tweet.text)
        tt = TweetTokenizer()
        tweetTokenized = tt.tokenize(text)
        tweetTagged = pos_tag(tweetTokenized)
        tweetsTagged.append(tweetTagged)
    return tweetsTagged

I think the error may have to do with TweetTokenizer not recognizing certain Unicode apostrophes but I may be wrong about that.

The NLTK TweetTokenizer does not work properly with irregular quotes. I would advise pre-processing your data to normalize these forms of quotes to regular ones.

For reference:

>>> from nltk.tokenize import TweetTokenizer
>>> TweetTokenizer().tokenize("can't") 
["can't"]
>>> TweetTokenizer().tokenize("can’t") 
['can', '’', 't']

Perhaps Python: Replace typographical quotes, dashes, etc. with their ascii counterparts would help for this.

Replace curly quotes with straight quotes:

from nltk.tokenize import TweetTokenizer
def tweetsTagger(tweets): #Tokenizes and tags the tweets
    tweetsTagged = []
    for tweet in tweets: #tweet is a status object from Twitter's Tweepy API
        text = ""
        if hasattr(tweet, 'full_text'):
            text = str(tweet.full_text)
        else:
            text = str(tweet.text)
        tt = TweetTokenizer()
        tweetTokenized = tt.tokenize(text.replace("’","'")) # << HERE
        tweetTagged = pos_tag(tweetTokenized)
        tweetsTagged.append(tweetTagged)
    return tweetsTagged

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM