在兩列 dataframe 的單列系列上運行 CountVectorizer？

Question

如何將具有多列的 pandas dataframe 的單列轉換為 CountVectorizer 的系列？

我有一個 Pandas dataframe 有 2 列 x 9372 記錄（行）：

第一列稱為twodig ，是 integer
第二列稱為descrp並且是 varchar
dataframe 的圖像

刪除停用詞和特殊字符后，我只想在descrp列上使用 CountVectorizer ，但仍保留twodig 。

import pandas
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
bowmatrix = vectorizer.fit_transform(df)

但是，運行 CountVectorizer 需要將 dataframe 轉換為 pandas 系列，然后使用 CountVectorizer 運行。

corpus = pd.Series(df)

但是當我運行腳本時，產生的錯誤：錯誤的項目數通過 2，放置意味着 9372

Answer 1

您只能從 DataFrame 那里獲得該列，如下所示： df["descrp"]所以您的代碼將是：

import pandas

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

bowmatrix = vectorizer.fit_transform(df["descrp"])

Answer 2

你可以做這樣的事情，但在那之后使用起來就不是最佳的了。

import pandas 
from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 
df["bowmatrix"] = vectorizer.fit_transform(df["descrp"])

在兩列 dataframe 的單列系列上運行 CountVectorizer？

問題描述

2 個解決方案

解決方案1
1 已采納 2019-10-25 20:11:37

解決方案2
0 2019-10-25 20:21:37

在兩列 dataframe 的單列系列上運行 CountVectorizer？

問題描述

2 個解決方案

解決方案1 1 已采納 2019-10-25 20:11:37

解決方案2 0 2019-10-25 20:21:37

解決方案1
1 已采納 2019-10-25 20:11:37

解決方案2
0 2019-10-25 20:21:37