I have a dataset of Comments from twitter(eg 10 instances). I want to classify and count the similar words using Scikit-learn Python as output as following:
**Dataset:**
comment_text
r u cmng or u not cmng
I am fine, r u fine
my frnd is gr8, wll dn.
we r nt going tday
I have a fever.
It should be shown like this output
Words Count
u 3
r 3
i 2
cmng 2
fine, 1
wll 1
have 1
fever. 1
not 1
tday 1
my 1
we 1
a 1
or 1
nt 1
going 1
fine 1
dn. 1
gr8, 1
frnd 1
am 1
is 1
dtype: int64
i use this code but is shows wrong output
text = train_dataset_male['comment_text']
print(text)
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())
Python has a neat module in the standard library called "collections" for this type of thing. In it you can use the Counter which ends up being a dictionary that keeps track of individual items and counts the number of times they appear in an iterable(list, tuple, etc)
so...
from collections import Counter
text_counter = Counter(dataset)
# to access the times the word "you" is seen
text_counter.get("you")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.