简体   繁体   English

在两列 dataframe 的单列系列上运行 CountVectorizer?

[英]Run CountVectorizer on single column Series from two-column dataframe?

How does one convert a single column from a pandas dataframe with multiple columns into a Series for CountVectorizer?如何将具有多列的 pandas dataframe 的单列转换为 CountVectorizer 的系列?

I have a Pandas dataframe with 2 columns x 9372 records (rows):我有一个 Pandas dataframe 有 2 列 x 9372 记录(行):

  • The first column is called twodig and is an integer第一列称为twodig ,是 integer
  • The second column is called descrp and is a varchar第二列称为descrp并且是 varchar
  • image of dataframe dataframe 的图像

After removing stopwords and special characters, I want to use CountVectorizer on descrp column only, but still keep twodig .删除停用词和特殊字符后,我只想在descrp列上使用 CountVectorizer ,但仍保留twodig

import pandas
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bowmatrix = vectorizer.fit_transform(df)

However running CountVectorizer requires the dataframe to be converted into a pandas series, which is then run with CountVectorizer.但是,运行 CountVectorizer 需要将 dataframe 转换为 pandas 系列,然后使用 CountVectorizer 运行。

corpus = pd.Series(df)

But when I run the script, the resulting error: Wrong number of items passed 2, placement implies 9372但是当我运行脚本时,产生的错误:错误的项目数通过 2,放置意味着 9372

You can get that column only from you DataFrame like this: df["descrp"] so your code will be:您只能从 DataFrame 那里获得该列,如下所示: df["descrp"]所以您的代码将是:

import pandas

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

bowmatrix = vectorizer.fit_transform(df["descrp"])

You can do something like this, but is less than optimal to work with after that.你可以做这样的事情,但在那之后使用起来就不是最佳的了。

import pandas 
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 
df["bowmatrix"] = vectorizer.fit_transform(df["descrp"])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM