简体   繁体   中英

unable to remove stopwords;NLP

I have csv file which contains 2 column 'Complaint Details' and 'DispositionCode'.I want to classify the complaintDetails into 8 different classes of dispostionCode such as 'Door locked from inside','Vendor error','Missing key or lock'... The dataset is shown in the image. enter image description here

What would be good method to classify and find accuracy.

Initially I am trying with removing stopwords from the ComplaintDetails and then use naivebayes classifier

The code is as follows:

import csv
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
your_list=[]
with open('H:/Project/rash.csv', 'r') as f:
  reader = csv.reader(f)
  your_list = list(reader)
print(your_list)
stop_words=set(stopwords.words("english"))
words= word_tokenize(your_list)
filteredSent=[]
for w in words:
    if w not in stop_words:
       filteredSent.append()
print(filteredSent)

But I am getting following error:-

for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or bytes-like object

Your code never gets to the stopwords, since the error is due to misusing word_tokenize() . It needs to be called on a single string, not on your whole dataset. You can tokenize your data like this:

for row in your_list:
    row[0] = word_tokenize(row[0])

You'll now need to rethink the rest of your code. You have a whole list of sentences, not just one. Use a loop like the above so you're examining the words of one sentence at a time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM