在 scikit-learn 中实现词袋

Question

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
headers = ['label', 'sms_message']
df = pd.read_csv ('spam.csv', names = headers)
df ['label'] = df['label'].map({'ham': 0, 'spam': 1})
print (df.head(7))
print (df.shape)
count_vector = CountVectorizer()
#count_vector.fit(df)
y = count_vector.fit_transform(df)
count_vector.get_feature_names()
doc_array = y.toarray()
print (doc_array)
frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix

示例数据和输出：

   label                                        sms_message
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...

(5573, 2)
[[1 0]
 [0 1]]

label   sms_message
0   1   0
1   0   1

我的问题：

我的 csv 文件基本上是多行短信。

我不明白为什么我只得到列标签的输出，而不是整行 sms 文本。

感谢您的任何帮助。

Answer 1

仅将 sms_message 列传递给计数向量化器，如下所示。

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

docs = ['Tea is an aromatic beverage..',
        'After water, it is the most widely consumed drink in the world',
        'There are many different types of tea.',
        'Tea has a stimulating effect in humans.',
        'Tea originated in Southwest China during the Shang dynasty'] 

df = pd.DataFrame({'sms_message': docs, 'label': np.random.choice([0, 1], size=5)})

cv = CountVectorizer()
counts = cv.fit_transform(df['sms_message'])

df_counts = pd.DataFrame(counts.A, columns=cv.get_feature_names())
df_counts['label'] = df['label']

输出：

df_counts

Out[26]: 
   after  an  are  aromatic  beverage  ...  types  water  widely  world  label
0      0   1    0         1         1  ...      0      0       0      0      1
1      1   0    0         0         0  ...      0      1       1      1      0
2      0   0    1         0         0  ...      1      0       0      0      1
3      0   0    0         0         0  ...      0      0       0      0      1
4      0   0    0         0         0  ...      0      0       0      0      0

[5 rows x 32 columns]

Answer 2

使用@KRKirov 仅将列标题（'sms_message）传递给计数向量器的答案，我编辑了我的代码并获得了正确的输出：

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np

headers = ['label', 'sms_message']
df = pd.read_csv ('spam.csv', names = headers)
df ['label'] = df['label'].map({'ham': 0, 'spam': 1})
df ["sms_message"]= df["sms_message"].str.lower().str.replace('[^\w\s]','')

count_vector = CountVectorizer()
y = count_vector.fit_transform(df['sms_message'])
doc_array = y.toarray()

frequency_matrix = pd.DataFrame(doc_array, columns = count_vector.get_feature_names())
frequency_matrix

在 scikit-learn 中实现词袋

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-12-21 12:53:54

解决方案2
0 2019-12-21 14:33:15

在 scikit-learn 中实现词袋

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-12-21 12:53:54

解决方案2 0 2019-12-21 14:33:15

解决方案1
2 已采纳 2019-12-21 12:53:54

解决方案2
0 2019-12-21 14:33:15