![](/img/trans.png)
[英]finding the number of clusters in a vectorized text document with sklearn tf-idf
[英]TF-IDF by string line rather than whole text document
我已經將TF-IDF實現到一個簡單的程序中,但是想要計算每行而不是整個文件的TF-IDF。
我已使用from sklearn.feature_extraction.text import TfidfVectorizer
並使用from sklearn.feature_extraction.text import TfidfVectorizer
以下鏈接作為示例tf-idf特征權重進行了研究
這是我的代碼:
from sklearn.feature_extraction.text import TfidfVectorizer
f1 = open('testDB.txt','r')
a = []
for line in f1:
a.append(line.strip())
f1.close()
f2 = open('testDB1.txt','r')
b = []
for line in f2:
b.append(line.strip())
f2.close()
for i in range(min(len(a), len(b))):
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(a, b)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
文本文件包括:
testDB.txt =
hello my name is tom
epping is based just outside of london football
epping football club is really bad
testDB1.txt =
hello my name is tom
i live in chelmsford and i play football
chelmsford is a lovely city
輸出:
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'zain': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
如您所見,它對兩個文本文件而不是每一行都對整個文檔執行TF-IDF。 我已經嘗試過使用for循環來實現每行,但是我無法弄清楚問題所在。
理想情況下,輸出將每行打印TF-IDF。 例如
u'hello': 0.23123, u'my': 0.3123123, u'name': '0.2313213, u'is': 0.3213132, u'tom': 0.3214344
等等
如果有人可以幫助我或提供任何建議,那將是很好的。
嗯...您在這里傳遞a和b:
for i in range(min(len(a), len(b))):
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(a, b)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
當a和b是數組時...(字符串列表)。 您可以執行以下操作:
for i in range(min(len(a), len(b))):
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(a[i], b[i])
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
但是正如評論中提到的那樣,目前尚不清楚您在做什么...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.