简体   繁体   中英

Word count Matrix of document corpus with Pandas Dataframe

Well, I have a corpus of 2000+ text documents and I'm trying to make a matrix with pandas dataframe in the most elegant way. The matrix would look like this:

df=pd.DataFrame(index=['Doc1_name','Doc2_name','Doc3_name','...','Doc2000_name']
                , columns=['word1','word2','word3','...','word50956'])
df.iloc[:,:] = 'count_word'
print(df)

I already have all the document in full-text in a list called "texts" I don't know if my question is clear enough.

Use sklearn's CountVectorizer :

from sklearn.feature_extraction.text import CountVectorizer


df = pd.DataFrame({'texts': ["This is one text (the first one)",
                             "This is the second text",
                             "And, finally, a third text"
                            ]})

cv = CountVectorizer()
cv.fit(df['texts'])

results = cv.transform(df['texts'])

print(results.shape) # Sparse matrix, (3, 9)

If the corpus is small enough to fit in your memory (and 2000+ is small enough), you can convert the sparse matrix into a pandas dataframe as follow:

features = cv.get_feature_names()
df_res = pd.DataFrame(results.toarray(), columns=features)

df_res is the result you want:

df_res
index and   finally first   is  one second  text    the third   this
0     0     0       1       1   2   0       1       1   0       1
1     0     0       0       1   0   1       1       1   0       1
2     1     1       0       0   0   0       1       0   1       0

If case you get a MemoryError , you can reduce the vocabulary of words to consider using different parameters of CountVectorizer :

  1. Set parameter stop_words='english' to ignore english stopwords (like the and `and)
  2. Use min_df and max_df , which makes CountVectorizer ignore some words based on document frequency (too frequent or infrequent words, which may be useless)
  3. Use max_features , to use only the most common n words.

For any not-small corpus of text I would strongly recommend using scikit-learn 's CountVectorizer .

It's as simple as:

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
word_counts = count_vectorizer.fit_transform(corpus) # list of documents (as strings)

It doesn't exactly give you the dataframe in your desired structure, but it shouldn't be hard to construct it using the vocabulary_ attribute of count_vectorizer , which contains the mapping of the term to its index in the result matrix.

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
doc_term_matrix = count_vect.fit_transform(df['texts'])

# covert doc term matrix to array
df_vector = pd.DataFrame(doc_term_matrix.toarray())
df_vector.columns = count_vect.get_feature_names()
df_vector.head()
def create_doc_term_matrix(text,vectorizer):    
        doc_term_matrix = vectorizer.fit_transform(text)    
        return pd.DataFrame(doc_term_matrix.toarray(), columns =    
        vectorizer.get_feature_names())  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM