I have a personal Python project where I am trying to tokenize tweets. I am using NLTK's TweetTokenizer to break up these tweets. I am running into an issue where contractions incorrectly get broken up
EX "can't" -> ["can", "'", "t"]
I am struggling to find any documentation on this error. I have pasted relevant code below.
An important note is that TweetTokenizer works with strings that I hardcode into my program, however, does not work with strings that originate from Twitter
from nltk.tokenize import TweetTokenizer
def tweetsTagger(tweets): #Tokenizes and tags the tweets
tweetsTagged = []
for tweet in tweets: #tweet is a status object from Twitter's Tweepy API
text = ""
if hasattr(tweet, 'full_text'):
text = str(tweet.full_text)
else:
text = str(tweet.text)
tt = TweetTokenizer()
tweetTokenized = tt.tokenize(text)
tweetTagged = pos_tag(tweetTokenized)
tweetsTagged.append(tweetTagged)
return tweetsTagged
I think the error may have to do with TweetTokenizer not recognizing certain Unicode apostrophes but I may be wrong about that.
The NLTK TweetTokenizer does not work properly with irregular quotes. I would advise pre-processing your data to normalize these forms of quotes to regular ones.
For reference:
>>> from nltk.tokenize import TweetTokenizer
>>> TweetTokenizer().tokenize("can't")
["can't"]
>>> TweetTokenizer().tokenize("can’t")
['can', '’', 't']
Perhaps Python: Replace typographical quotes, dashes, etc. with their ascii counterparts would help for this.
Replace curly quotes with straight quotes:
from nltk.tokenize import TweetTokenizer
def tweetsTagger(tweets): #Tokenizes and tags the tweets
tweetsTagged = []
for tweet in tweets: #tweet is a status object from Twitter's Tweepy API
text = ""
if hasattr(tweet, 'full_text'):
text = str(tweet.full_text)
else:
text = str(tweet.text)
tt = TweetTokenizer()
tweetTokenized = tt.tokenize(text.replace("’","'")) # << HERE
tweetTagged = pos_tag(tweetTokenized)
tweetsTagged.append(tweetTagged)
return tweetsTagged
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.