Word count Matrix of document corpus with Pandas Dataframe

Question

Well, I have a corpus of 2000+ text documents and I'm trying to make a matrix with pandas dataframe in the most elegant way. The matrix would look like this:

df=pd.DataFrame(index=['Doc1_name','Doc2_name','Doc3_name','...','Doc2000_name']
                , columns=['word1','word2','word3','...','word50956'])
df.iloc[:,:] = 'count_word'
print(df)

I already have all the document in full-text in a list called "texts" I don't know if my question is clear enough.

Answer 1

Use sklearn's CountVectorizer :

from sklearn.feature_extraction.text import CountVectorizer


df = pd.DataFrame({'texts': ["This is one text (the first one)",
                             "This is the second text",
                             "And, finally, a third text"
                            ]})

cv = CountVectorizer()
cv.fit(df['texts'])

results = cv.transform(df['texts'])

print(results.shape) # Sparse matrix, (3, 9)

If the corpus is small enough to fit in your memory (and 2000+ is small enough), you can convert the sparse matrix into a pandas dataframe as follow:

features = cv.get_feature_names()
df_res = pd.DataFrame(results.toarray(), columns=features)

df_res is the result you want:

df_res
index and   finally first   is  one second  text    the third   this
0     0     0       1       1   2   0       1       1   0       1
1     0     0       0       1   0   1       1       1   0       1
2     1     1       0       0   0   0       1       0   1       0

If case you get a MemoryError , you can reduce the vocabulary of words to consider using different parameters of CountVectorizer :

Set parameter stop_words='english' to ignore english stopwords (like the and `and)
Use min_df and max_df , which makes CountVectorizer ignore some words based on document frequency (too frequent or infrequent words, which may be useless)
Use max_features , to use only the most common n words.

Answer 2

For any not-small corpus of text I would strongly recommend using scikit-learn 's CountVectorizer .

It's as simple as:

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
word_counts = count_vectorizer.fit_transform(corpus) # list of documents (as strings)

It doesn't exactly give you the dataframe in your desired structure, but it shouldn't be hard to construct it using the vocabulary_ attribute of count_vectorizer , which contains the mapping of the term to its index in the result matrix.

Answer 3

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(df['texts'])

# covert doc term matrix to array
df_vector = pd.DataFrame(doc_term_matrix.toarray())
df_vector.columns = count_vect.get_feature_names()
df_vector.head()

Answer 4

def create_doc_term_matrix(text,vectorizer):    
        doc_term_matrix = vectorizer.fit_transform(text)    
        return pd.DataFrame(doc_term_matrix.toarray(), columns =    
        vectorizer.get_feature_names())

Word count Matrix of document corpus with Pandas Dataframe

Question

4 answers

solution1
3 ACCPTED 2018-11-23 18:59:56

solution2
0 2018-11-23 19:02:23

solution3
0 2021-04-27 02:23:34

solution4
0 2022-04-28 07:29:09

Word count Matrix of document corpus with Pandas Dataframe

Question

4 answers

solution1 3 ACCPTED 2018-11-23 18:59:56

solution2 0 2018-11-23 19:02:23

solution3 0 2021-04-27 02:23:34

solution4 0 2022-04-28 07:29:09

solution1
3 ACCPTED 2018-11-23 18:59:56

solution2
0 2018-11-23 19:02:23

solution3
0 2021-04-27 02:23:34

solution4
0 2022-04-28 07:29:09