How does one convert a single column from a pandas dataframe with multiple columns into a Series for CountVectorizer?
I have a Pandas dataframe with 2 columns x 9372 records (rows):
twodig
and is an integerdescrp
and is a varcharAfter removing stopwords and special characters, I want to use CountVectorizer on descrp
column only, but still keep twodig
.
import pandas
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
bowmatrix = vectorizer.fit_transform(df)
However running CountVectorizer requires the dataframe to be converted into a pandas series, which is then run with CountVectorizer.
corpus = pd.Series(df)
But when I run the script, the resulting error: Wrong number of items passed 2, placement implies 9372
You can get that column only from you DataFrame like this: df["descrp"]
so your code will be:
import pandas
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
bowmatrix = vectorizer.fit_transform(df["descrp"])
You can do something like this, but is less than optimal to work with after that.
import pandas
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
df["bowmatrix"] = vectorizer.fit_transform(df["descrp"])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.