将CountVectorizer结果设置为pandas.DataFrame

Question

I need to set pandas.DataFrame with matrix features produced by CountVectorizer. 我需要使用CountVectorizer产生的矩阵功能设置pandas.DataFrame。

count_vect = CountVectorizer()
count_vect.fit(text)

xtrain_count = count_vect.transform(train_x)
SaveTxt = pandas.DataFrame()
SaveTxt['text']=xtrain_count

but in the last line SaveTxt['text']=xtrain_count I got following errors! 但是在最后一行SaveTxt['text']=xtrain_count我遇到了以下错误！

 raise ValueError('Cannot set a frame with no defined index '
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series

I was wondering how should I set result matrix of CountVectorizer to dataframe? 我想知道如何将CountVectorizer的结果矩阵设置为dataframe？ CountVectorizer result is a csr_matrix with about 20000 rows and 200000 columns and contents are integer (1 to 6) CountVectorizer结果是具有约20000行和200000列的csr_matrix，内容为整数（1到6）

Answer 1

pd.DataFrame(my_csr_matrix.todense())

Here is a proof of concept: 这是一个概念证明：

import random

import lorem
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

m = 10
random.seed(0)
data = [lorem.paragraph() for _ in range(m)]

cv = CountVectorizer()
cv.fit(data)

df = pd.DataFrame(data=cv.transform(data).todense())

print(df.shape)
print(df.head())

Result: 结果：

(10, 27)
   0  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26
0  1  2  2  3  3  0  2  0  3  1   2   2   2   1   1   5   3   2   1   3   1   0   2   2   1   4   4
1  0  0  4  1  0  0  1  3  0  3   2   0   1   0   1   1   1   5   3   2   0   0   1   0   0   3   1
2  0  2  3  1  1  1  2  0  2  0   1   1   1   1   1   3   2   0   1   2   1   4   3   0   1   2   5
3  3  3  4  7  1  2  4  2  2  0   1   2   1   1   0   0   0   2   1   3   2   2   2   2   0   3   4
4  2  3  1  2  3  4  1  1  4  3   2   4   2   2   3   3   2   0   2   3   2   5   4   3   2   1   2

将CountVectorizer结果设置为pandas.DataFrame

问题描述

1 个解决方案

解决方案1
-1 2019-08-02 14:37:45

将CountVectorizer结果设置为pandas.DataFrame

问题描述

1 个解决方案

解决方案1 -1 2019-08-02 14:37:45

解决方案1
-1 2019-08-02 14:37:45