I wanto to read a SQL
file and use CountVectorizer
to get word occurences.
I have the following code so far:
import re
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_sql(q, dlconn)
print(df)
count_vect = CountVectorizer()
X_train_counts= count_vect.fit_transform(df)
print(X_train_counts.shape)
print(count_vect.vocabulary_)
This gives an output of 'cat': 1, 'dog': 0
It seems that it is taking just the name of the column animal
and counting from there.
How do I get it to access the full column and get a chart that shows every word in the column and its frequency?
According to the CountVectorizer
docs , the method fit_transform()
expects an iterable of strings. It cannot handle a DataFrame
directly.
But iterating over a dataframe returns the labels of the columns, not the values. I suggest you try df.itertuples()
instead.
Try something like this:
value_list = [
row[0]
for row in df.itertuples(index=False, name=None)]
print(value_list)
print(type(value_list))
print(type(value_list[0]))
X_train_counts = count_vect.fit_transform(value_list)
Each value in value_list
should be of type str
. Let us know if that helps.
Here is a little example:
>>> import pandas as pd
>>> df = pd.DataFrame(['my big dog', 'my lazy cat'])
>>> df
0
0 my big dog
1 my lazy cat
>>> value_list = [row[0] for row in df.itertuples(index=False, name=None)]
>>> value_list
['my big dog', 'my lazy cat']
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> cv = CountVectorizer()
>>> x_train = cv.fit_transform(value_list)
>>> x_train
<2x5 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in Compressed Sparse Row format>
>>> x_train.toarray()
array([[1, 0, 1, 0, 1],
[0, 1, 0, 1, 1]], dtype=int64)
>>> cv.vocabulary_
{'my': 4, 'big': 0, 'dog': 2, 'lazy': 3, 'cat': 1}
And now you can display the word count of each row (each input string separately):
>>> for word, col in cv.vocabulary_.items():
... for row in range(x_train.shape[0]):
... print('word:{:10s} | row:{:2d} | count:{:2d}'.format(word, row, x_train[row,col]))
word:my | row: 0 | count: 1
word:my | row: 1 | count: 1
word:big | row: 0 | count: 1
word:big | row: 1 | count: 0
word:dog | row: 0 | count: 1
word:dog | row: 1 | count: 0
word:lazy | row: 0 | count: 0
word:lazy | row: 1 | count: 1
word:cat | row: 0 | count: 0
word:cat | row: 1 | count: 1
You can also display the total word count (sum of rows):
>>> x_train_sum = x_train.sum(axis=0)
>>> x_train_sum
matrix([[1, 1, 1, 1, 2]], dtype=int64)
>>> for word, col in cv.vocabulary_.items():
... print('word:{:10s} | count:{:2d}'.format(word, x_train_sum[0, col]))
word:my | count: 2
word:big | count: 1
word:dog | count: 1
word:lazy | count: 1
word:cat | count: 1
>>> with open('my-file.csv', 'w') as f:
... for word, col in cv.vocabulary_.items():
... f.write('{};{}\n'.format(word, x_train_sum[0, col]))
This should clarify how you can use the tools you have.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.