读取SQL文件并使用Count Vectorizer获取单词出现

Question

I wanto to read a SQL file and use CountVectorizer to get word occurences. 我想读取一个SQL文件并使用CountVectorizer来获取单词出现。

I have the following code so far: 到目前为止，我有以下代码：

import re
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer




df = pd.read_sql(q, dlconn)
print(df)

count_vect = CountVectorizer()
X_train_counts= count_vect.fit_transform(df)

print(X_train_counts.shape)
print(count_vect.vocabulary_)

This gives an output of 'cat': 1, 'dog': 0 这给出了'cat': 1, 'dog': 0的输出'cat': 1, 'dog': 0

It seems that it is taking just the name of the column animal and counting from there. 它似乎只取了列animal的名字并从那里算起。

How do I get it to access the full column and get a chart that shows every word in the column and its frequency? 如何让它访问完整列并获得一个图表，显示列中的每个单词及其频率？

Answer 1

According to the CountVectorizer docs , the method fit_transform() expects an iterable of strings. 根据CountVectorizer文档，方法fit_transform()需要一个可迭代的字符串。 It cannot handle a DataFrame directly. 它无法直接处理DataFrame 。

But iterating over a dataframe returns the labels of the columns, not the values. 但迭代数据帧会返回列的标签，而不是值。 I suggest you try df.itertuples() instead. 我建议你试试df.itertuples() 。

Try something like this: 尝试这样的事情：

value_list = [
    row[0]
    for row in df.itertuples(index=False, name=None)]
print(value_list)
print(type(value_list))
print(type(value_list[0]))

X_train_counts = count_vect.fit_transform(value_list)

Each value in value_list should be of type str . value_list每个值都应为str类型。 Let us know if that helps. 如果有帮助，请告诉我们。

Here is a little example: 这是一个小例子：

>>> import pandas as pd
>>> df = pd.DataFrame(['my big dog', 'my lazy cat'])
>>> df
             0
0   my big dog
1  my lazy cat

>>> value_list = [row[0] for row in df.itertuples(index=False, name=None)]
>>> value_list
['my big dog', 'my lazy cat']

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer()
>>> x_train = cv.fit_transform(value_list)
>>> x_train
<2x5 sparse matrix of type '<class 'numpy.int64'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> x_train.toarray()
array([[1, 0, 1, 0, 1],
       [0, 1, 0, 1, 1]], dtype=int64)
>>> cv.vocabulary_
{'my': 4, 'big': 0, 'dog': 2, 'lazy': 3, 'cat': 1}

And now you can display the word count of each row (each input string separately): 现在您可以显示每行的字数（每个输入字符串分别）：

>>> for word, col in cv.vocabulary_.items():
...     for row in range(x_train.shape[0]):
...         print('word:{:10s} | row:{:2d} | count:{:2d}'.format(word, row, x_train[row,col]))
word:my         | row: 0 | count: 1
word:my         | row: 1 | count: 1
word:big        | row: 0 | count: 1
word:big        | row: 1 | count: 0
word:dog        | row: 0 | count: 1
word:dog        | row: 1 | count: 0
word:lazy       | row: 0 | count: 0
word:lazy       | row: 1 | count: 1
word:cat        | row: 0 | count: 0
word:cat        | row: 1 | count: 1

You can also display the total word count (sum of rows): 您还可以显示总字数（行数之和）：

>>> x_train_sum = x_train.sum(axis=0)
>>> x_train_sum
    matrix([[1, 1, 1, 1, 2]], dtype=int64)
>>> for word, col in cv.vocabulary_.items():
...     print('word:{:10s} | count:{:2d}'.format(word, x_train_sum[0, col]))
word:my         | count: 2
word:big        | count: 1
word:dog        | count: 1
word:lazy       | count: 1
word:cat        | count: 1

>>> with open('my-file.csv', 'w') as f:
...     for word, col in cv.vocabulary_.items():
...         f.write('{};{}\n'.format(word, x_train_sum[0, col]))

This should clarify how you can use the tools you have. 这应该说明如何使用您拥有的工具。

读取SQL文件并使用Count Vectorizer获取单词出现

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-09-14 18:58:33

读取SQL文件并使用Count Vectorizer获取单词出现

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-09-14 18:58:33

解决方案1
3 已采纳 2018-09-14 18:58:33