简体   繁体   中英

Text Pre-processing with NLTK

I am practicing on using NLTK to remove certain features from raw tweets and subsequently hoping to remove tweets that are (to me) irelevant (eg empty tweet or single word tweets). However, it seems that some of the single word tweets are not removed. I am also facing an issue with not able to remove any stopword that are either at the beginning or end of sentence.

Any advice? At the moment, I hope to pass back a sentence as an output rather than a list of tokenized words.

Any other comment on improving the code (processing time, elegance) are welcome.

import string
import numpy as np
import nltk
from nltk.corpus import stopwords

cache_english_stopwords=stopwords.words('english')
cache_en_tweet_stopwords=stopwords.words('english_tweet')

# For clarity, df is a pandas dataframe with a column['text'] together with other headers.

def tweet_clean(df):
    temp_df = df.copy()
    # Remove hyperlinks
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('https?:\/\/.*\/\w*', '', regex=True)
    # Remove hashtags
    # temp_df.loc[:,"text"]=temp_df.loc[:,"text"].replace('#\w*', '', regex=True)
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('#', ' ', regex=True)
    # Remove citations
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\@\w*', '', regex=True)
    # Remove tickers
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\$\w*', '', regex=True)
    # Remove punctuation
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[' + string.punctuation + ']+', '', regex=True)
    # Remove stopwords
    for tweet in temp_df.loc[:,"text"]:
        tweet_tokenized=nltk.word_tokenize(tweet)
        for w in tweet_tokenized:
            if (w.lower() in cache_english_stopwords) | (w.lower() in cache_en_tweet_stopwords):
                temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\W*\s?\n?]'+w+'[\W*\s?]', ' ', regex=True)
                #print("w in stopword")
    # Remove quotes
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\&*[amp]*\;|gt+', '', regex=True)
    # Remove RT
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+rt\s+', '', regex=True)
    # Remove linebreak, tab, return
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('[\n\t\r]+', ' ', regex=True)
    # Remove via with blank
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('via+\s', '', regex=True)
    # Remove multiple whitespace
    temp_df.loc[:, "text"] = temp_df.loc[:, "text"].replace('\s+\s+', ' ', regex=True)
    # Remove single word sentence
    for tweet_sw in temp_df.loc[:, "text"]:
        tweet_sw_tokenized = nltk.word_tokenize(tweet_sw)
        if len(tweet_sw_tokenized) <= 1:
            temp_df.loc["text"] = np.nan
    # Remove empty rows
    temp_df.loc[(temp_df["text"] == '') | (temp_df['text'] == ' ')] = np.nan
    temp_df = temp_df.dropna()
    return temp_df

What is df? a list of tweets? You maybe should consider cleaning the tweet one after the other and not as a list of tweets. It would be easier to have a function tweet_cleaner(single_tweet) .

nltk provides a TweetTokenizer to clean the tweets.

the "re" package provides good solutions to use regex.

I advice you to create a variable for an easier use of temp_df.loc[:, "text"]

Deleting stopwords in a sentence is described [here] ( Stopword removal with NLTK ): clean_wordlist = [i for i in sentence.lower().split() if i not in stopwords]

If you want to use regex (with the re package), you can

  1. create a regex pattern composed of all the stopwords (out of the tweet_clean function): stop_pattern = re.compile('|'.join(stoplist)(?siu))
    (?siu) for multiline, ignorecase, unicode

  2. and use this pattern to clean any string clean_string = stop_pattern.sub('', input_string)

(you can concatenate the 2 stoplists if having separate ones is not needed)

To remove 1 words tweet you could only keep the one longest than 1 word:
if len(tweet_sw_tokenized) >= 1: kept_ones.append(tweet_sw)

With advice from mquantin, I have modified my code to clean tweets individually as a sentence. Here is my amateur attempt with a sample tweet that I believe covers most scenarios (Let me know if you encounter any other cases that deserve a clean up):

import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer


cache_english_stopwords=stopwords.words('english')



def tweet_clean(tweet):
    # Remove tickers
    sent_no_tickers=re.sub(r'\$\w*','',tweet)
    print('No tickers:')
    print(sent_no_tickers)
    tw_tknzr=TweetTokenizer(strip_handles=True, reduce_len=True)
    temp_tw_list = tw_tknzr.tokenize(sent_no_tickers)
    print('Temp_list:')
    print(temp_tw_list)
    # Remove stopwords
    list_no_stopwords=[i for i in temp_tw_list if i.lower() not in     cache_english_stopwords]
    print('No Stopwords:')
    print(list_no_stopwords)
    # Remove hyperlinks
    list_no_hyperlinks=[re.sub(r'https?:\/\/.*\/\w*','',i) for i in list_no_stopwords]
    print('No hyperlinks:')
    print(list_no_hyperlinks)
    # Remove hashtags
    list_no_hashtags=[re.sub(r'#', '', i) for i in list_no_hyperlinks]
    print('No hashtags:')
    print(list_no_hashtags)
    # Remove Punctuation and split 's, 't, 've with a space for filter
    list_no_punctuation=[re.sub(r'['+string.punctuation+']+', ' ', i) for i in list_no_hashtags]
    print('No punctuation:')
    print(list_no_punctuation)
    # Remove multiple whitespace
    new_sent = ' '.join(list_no_punctuation)
    # Remove any words with 2 or fewer letters
    filtered_list = tw_tknzr.tokenize(new_sent)
    list_filtered = [re.sub(r'^\w\w?$', '', i) for i in filtered_list]
    print('Clean list of words:')
    print(list_filtered)
    filtered_sent =' '.join(list_filtered)
    clean_sent=re.sub(r'\s\s+', ' ', filtered_sent)
    #Remove any whitespace at the front of the sentence
    clean_sent=clean_sent.lstrip(' ')
    print('Clean sentence:')
    print(clean_sent)

s0='    RT @Amila #Test\nTom\'s newly listed Co. &amp; Mary\'s unlisted     Group to supply tech for nlTK.\nh.. $TSLA $AAPL https:// t.co/x34afsfQsh'
tweet_clean(s0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM