How can I use TF-IDF vectorizer
from the scikit-learn library to extract unigrams
and bigrams
of tweets? I want to train a classifier with the output.
This is the code from scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
TfidfVectorizer
has an ngram_range
parameter to determin the range of n-grams you want in the final matrix as new features. In your case, you want (1,2)
to go from unigrams to bigrams:
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()
pd.DataFrame(X, columns=vectorizer.get_feature_names())
and and this document document is first first document \
0 0.000000 0.000000 0.314532 0.000000 0.388510 0.388510
1 0.000000 0.000000 0.455513 0.356824 0.000000 0.000000
2 0.357007 0.357007 0.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 0.282940 0.000000 0.349487 0.349487
is is the is this one ... the the first \
0 0.257151 0.314532 0.000000 0.000000 ... 0.257151 0.388510
1 0.186206 0.227756 0.000000 0.000000 ... 0.186206 0.000000
2 0.186301 0.227873 0.000000 0.357007 ... 0.186301 0.000000
3 0.231322 0.000000 0.443279 0.000000 ... 0.231322 0.349487
...
According to the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
you specify n-grams when initializing TfidfVectorizer, TfidfVectorizer(ngram_range(min_n, max_n))
The lower and upper boundary of the range of n-values for different n-grams to be extracted ngram_range
of (1, 1)
means only unigrams
, (1, 2)
means unigrams
and bigrams
, and (2, 2)
means only bigrams
.
Answer would be vectorizer = TfidfVectorizer(ngram_range=(1,2))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.