简体   繁体   中英

How to remove stopwords in gensim?

df_clean['message'] = df_clean['message'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(x))

I tried this on a dataframe's column 'message' but I get the error:

TypeError: decoding to str: need a bytes-like object, list found

Apparently, the df_clean["message"] column contains a list of words, not a string, hence the error saying that need a bytes-like object, list found .

To fix this issue, you need to convert it to string again using join() method like so:

df_clean['message'] = df_clean['message'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(" ".join(x)))

Notice that the df_clean["message"] will contain string objects after applying the previous code.

This is not a gensim problem, the error is raised by pandas : there is a value in your column message that is of type list instead of string . Here's a minimal pandas example:

import pandas as pd
from gensim.parsing.preprocessing import remove_stopwords
df = pd.DataFrame([['one', 'two'], ['three', ['four']]], columns=['A', 'B'])
df.A.apply(remove_stopwords) # works fine

df.B.apply(remove_stopwords)

TypeError: decoding to str: need a bytes-like object, list found

What the error is saying is that remove_stopwords needs string type object and you are passing a list , So before removing stop words check that all the values in column are of string type. See the Docs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM