简体   繁体   中英

How to classifying and count the number of words in Python

I have a dataset of Comments from twitter(eg 10 instances). I want to classify and count the similar words using Scikit-learn Python as output as following:

**Dataset:** 
  comment_text 
 r u cmng or u not cmng   
I am fine, r u fine  
my frnd is gr8, wll dn.  
 we r nt going tday   
I have a fever.  

It should be shown like this output

 Words    Count

u         3
r         3
i         2
cmng      2
fine,     1
wll       1
have      1
fever.    1
not       1
tday      1
my        1
we        1
a         1
or        1
nt        1
going     1
fine      1
dn.       1
gr8,      1
frnd      1
am        1
is        1
dtype: int64

i use this code but is shows wrong output

    text = train_dataset_male['comment_text']
    print(text)
    vectorizer = TfidfVectorizer()
    # tokenize and build vocab
    vectorizer.fit(text)
    # summarize
    print(vectorizer.vocabulary_)
    print(vectorizer.idf_)
    # encode document
    vector = vectorizer.transform([text[0]])
    # summarize encoded vector
    print(vector.shape)
    print(vector.toarray())

Python has a neat module in the standard library called "collections" for this type of thing. In it you can use the Counter which ends up being a dictionary that keeps track of individual items and counts the number of times they appear in an iterable(list, tuple, etc)

so...

from collections import Counter

text_counter = Counter(dataset)
# to access the times the word "you" is seen
text_counter.get("you")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM