Run CountVectorizer on single column Series from two-column dataframe?

Question

How does one convert a single column from a pandas dataframe with multiple columns into a Series for CountVectorizer?

I have a Pandas dataframe with 2 columns x 9372 records (rows):

The first column is called twodig and is an integer
The second column is called descrp and is a varchar
image of dataframe

After removing stopwords and special characters, I want to use CountVectorizer on descrp column only, but still keep twodig .

import pandas
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bowmatrix = vectorizer.fit_transform(df)

However running CountVectorizer requires the dataframe to be converted into a pandas series, which is then run with CountVectorizer.

corpus = pd.Series(df)

But when I run the script, the resulting error: Wrong number of items passed 2, placement implies 9372

Answer 1

You can get that column only from you DataFrame like this: df["descrp"] so your code will be:

import pandas

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

bowmatrix = vectorizer.fit_transform(df["descrp"])

Answer 2

You can do something like this, but is less than optimal to work with after that.

import pandas 
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 
df["bowmatrix"] = vectorizer.fit_transform(df["descrp"])

Run CountVectorizer on single column Series from two-column dataframe?

Question

2 answers

solution1
1 ACCPTED 2019-10-25 20:11:37

solution2
0 2019-10-25 20:21:37

Run CountVectorizer on single column Series from two-column dataframe?

Question

2 answers

solution1 1 ACCPTED 2019-10-25 20:11:37

solution2 0 2019-10-25 20:21:37

solution1
1 ACCPTED 2019-10-25 20:11:37

solution2
0 2019-10-25 20:21:37