How to remove stopwords in gensim?

Question

df_clean['message'] = df_clean['message'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(x))

I tried this on a dataframe's column 'message' but I get the error:

TypeError: decoding to str: need a bytes-like object, list found

Answer 1

Apparently, the df_clean["message"] column contains a list of words, not a string, hence the error saying that need a bytes-like object, list found .

To fix this issue, you need to convert it to string again using join() method like so:

df_clean['message'] = df_clean['message'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(" ".join(x)))

Notice that the df_clean["message"] will contain string objects after applying the previous code.

Answer 2

This is not a gensim problem, the error is raised by pandas : there is a value in your column message that is of type list instead of string . Here's a minimal pandas example:

import pandas as pd
from gensim.parsing.preprocessing import remove_stopwords
df = pd.DataFrame([['one', 'two'], ['three', ['four']]], columns=['A', 'B'])
df.A.apply(remove_stopwords) # works fine

df.B.apply(remove_stopwords)

TypeError: decoding to str: need a bytes-like object, list found

Answer 3

What the error is saying is that remove_stopwords needs string type object and you are passing a list , So before removing stop words check that all the values in column are of string type. See the Docs

How to remove stopwords in gensim?

Question

3 answers

solution1
1 2020-06-15 12:12:20

solution2
0 2020-06-15 12:12:49

solution3
0 2020-06-15 12:14:26

How to remove stopwords in gensim?

Question

3 answers

solution1 1 2020-06-15 12:12:20

solution2 0 2020-06-15 12:12:49

solution3 0 2020-06-15 12:14:26

solution1
1 2020-06-15 12:12:20

solution2
0 2020-06-15 12:12:49

solution3
0 2020-06-15 12:14:26