I have a script which lists top n words (words with higher chi-squared value). However, instead of extracting fixed n number of words I want to extract all the words for which p-value is smaller than 0.05 ie rejects the null hypothesis.
Here is my code:
from sklearn.feature_selection import chi2
#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2score = chi2(X_tfidf, y)[0]
scores = list(zip(tfidf.get_feature_names(), chi2score))
chi2 = sorted(scores, key=lambda x:x[1])
allchi2 = list(zip(*chi2))
#lists top 20 words
allchi2 = allchi2[0][-20:]
So, In this case instead of listing top 20 words I want all the words that reject null hypothesis ie all the words in reviews that are dependent on the sentiment class(positive or negative)
from sklearn.feature_selection import chi2
#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2_score, pval_score = chi2(X_tfidf, y)
feature_pval_items = filter(lambda x:x[1]<0.05, zip(tfidf.get_feature_names(), pval_score))
you_want_feature_pval_items = sorted(feature_pval_items, key=lambda x:x[1])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.