I am using some text for some NLP analyses. I have cleaned the text taking steps to remove non-alphanumeric characters, blanks, duplicate words and stopwords, and also performed stemming and lemmatization:
from nltk.tokenize import word_tokenize
import nltk.corpus
import re
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import pandas as pd
data_df = pd.read_csv('path/to/file/data.csv')
stopwords = nltk.corpus.stopwords.words('english')
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
# Function to remove duplicates from sentence
def unique_list(l):
ulist = []
[ulist.append(x) for x in l if x not in ulist]
return ulist
for i in range(len(data_df)):
# Convert to lower case, split into individual words using word_tokenize
sentence = word_tokenize(data_df['O_Q1A'][i].lower()) #data['O_Q1A'][i].split(' ')
# Remove stopwords
filtered_sentence = [w for w in sentence if not w in stopwords]
# Remove duplicate words from sentence
filtered_sentence = unique_list(filtered_sentence)
# Remove non-letters
junk_free_sentence = []
for word in filtered_sentence:
junk_free_sentence.append(re.sub("[^\w\s]", " ", word)) # Remove non-letters, but don't remove whitespaces just yet
#junk_free_sentence.append(re.sub("/^[a-z]+$/", " ", word)) # Take only alphabests
# Stem the junk free sentence
stemmed_sentence = []
for w in junk_free_sentence:
stemmed_sentence.append(stemmer.stem(w))
# Lemmatize the stemmed sentence
lemmatized_sentence = []
for w in stemmed_sentence:
lemmatized_sentence.append(lemmatizer.lemmatize(w))
data_df['O_Q1A'][i] = ' '.join(lemmatized_sentence)
But when I display the top 10 words (according to some criteria), I still get some junk like:
ask
much
thank
work
le
know
via
sdh
n
sy
t
n t
recommend
never
Out of these top 10 words, only 5 are sensible ( ask
, know
, recommend
, thank
and work
). What more do I need to do to retain only meaningful words?
The default NLTK stoplist is a minimal one and it certainly does'nt conatin words like 'ask' 'much', because they are not generally nonsensical . These words are only irrelevnt to you but may not be to others. For your problem, you can always use your custom stopwords filter after using NLTK. A simple example:
def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('ask','much','thank','etc.'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords])
return new_str
Alternatively, you can edit the NLTK stopwords list, which is essentially a text file, stored in the NLTK installation directory.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.