Well, I have a corpus of 2000+ text documents and I'm trying to make a matrix with pandas dataframe in the most elegant way. The matrix would look like this:
df=pd.DataFrame(index=['Doc1_name','Doc2_name','Doc3_name','...','Doc2000_name']
, columns=['word1','word2','word3','...','word50956'])
df.iloc[:,:] = 'count_word'
print(df)
I already have all the document in full-text in a list called "texts" I don't know if my question is clear enough.
Use sklearn's CountVectorizer :
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'texts': ["This is one text (the first one)",
"This is the second text",
"And, finally, a third text"
]})
cv = CountVectorizer()
cv.fit(df['texts'])
results = cv.transform(df['texts'])
print(results.shape) # Sparse matrix, (3, 9)
If the corpus is small enough to fit in your memory (and 2000+ is small enough), you can convert the sparse matrix into a pandas dataframe as follow:
features = cv.get_feature_names()
df_res = pd.DataFrame(results.toarray(), columns=features)
df_res
is the result you want:
df_res
index and finally first is one second text the third this
0 0 0 1 1 2 0 1 1 0 1
1 0 0 0 1 0 1 1 1 0 1
2 1 1 0 0 0 0 1 0 1 0
If case you get a MemoryError
, you can reduce the vocabulary of words to consider using different parameters of CountVectorizer
:
stop_words='english'
to ignore english stopwords (like the
and `and)min_df
and max_df
, which makes CountVectorizer
ignore some words based on document frequency (too frequent or infrequent words, which may be useless)max_features
, to use only the most common n
words.For any not-small corpus of text I would strongly recommend using scikit-learn
's CountVectorizer .
It's as simple as:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
word_counts = count_vectorizer.fit_transform(corpus) # list of documents (as strings)
It doesn't exactly give you the dataframe in your desired structure, but it shouldn't be hard to construct it using the vocabulary_
attribute of count_vectorizer
, which contains the mapping of the term to its index in the result matrix.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(df['texts'])
# covert doc term matrix to array
df_vector = pd.DataFrame(doc_term_matrix.toarray())
df_vector.columns = count_vect.get_feature_names()
df_vector.head()
def create_doc_term_matrix(text,vectorizer):
doc_term_matrix = vectorizer.fit_transform(text)
return pd.DataFrame(doc_term_matrix.toarray(), columns =
vectorizer.get_feature_names())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.