List all the words in corpus that reject null hypothesis with chi-squared test

Question

I have a script which lists top n words (words with higher chi-squared value). However, instead of extracting fixed n number of words I want to extract all the words for which p-value is smaller than 0.05 ie rejects the null hypothesis.

Here is my code:

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2score = chi2(X_tfidf, y)[0]
scores = list(zip(tfidf.get_feature_names(), chi2score))
chi2 = sorted(scores, key=lambda x:x[1])
allchi2 = list(zip(*chi2))

#lists top 20 words
allchi2 = allchi2[0][-20:]

So, In this case instead of listing top 20 words I want all the words that reject null hypothesis ie all the words in reviews that are dependent on the sentiment class(positive or negative)

Answer 1

from sklearn.feature_selection import chi2

#vectorize top 100000 words
tfidf = TfidfVectorizer(max_features=100000,ngram_range=(1, 3))
X_tfidf = tfidf.fit_transform(df.review_text)
y = df.label
chi2_score, pval_score = chi2(X_tfidf, y)
feature_pval_items = filter(lambda x:x[1]<0.05, zip(tfidf.get_feature_names(), pval_score))
you_want_feature_pval_items = sorted(feature_pval_items, key=lambda x:x[1])

List all the words in corpus that reject null hypothesis with chi-squared test

Question

1 answers

solution1
1 ACCPTED 2019-03-05 12:02:37

List all the words in corpus that reject null hypothesis with chi-squared test

Question

1 answers

solution1 1 ACCPTED 2019-03-05 12:02:37

solution1
1 ACCPTED 2019-03-05 12:02:37