TF-IDF按字符串行而不是全文文件

Question

I have implemented TF-IDF into a simple program but want to calculate the TF-IDF per line rather than the whole file. 我已经将TF-IDF实现到一个简单的程序中，但是想要计算每行而不是整个文件的TF-IDF。

I have used from sklearn.feature_extraction.text import TfidfVectorizer and looked at the following link as an example tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer 我已使用from sklearn.feature_extraction.text import TfidfVectorizer并使用from sklearn.feature_extraction.text import TfidfVectorizer以下链接作为示例tf-idf特征权重进行了研究

This is my code: 这是我的代码：

from sklearn.feature_extraction.text import TfidfVectorizer

f1 = open('testDB.txt','r')
a = []  
for line in f1:
    a.append(line.strip())
f1.close()

f2 = open('testDB1.txt','r')
b = []
for line in f2:
    b.append(line.strip())
f2.close()

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a, b)
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

The text files include: 文本文件包括：

testDB.txt =
hello my name is tom
epping is based just outside of london football
epping football club is really bad

testDB1.txt = 
hello my name is tom
i live in chelmsford and i play football
chelmsford is a lovely city

The output: 输出：

{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'zain': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}
{u'based': 1.6931471805599454, u'name': 1.6931471805599454, u'just': 1.6931471805599454, u'outside': 1.6931471805599454, u'club': 1.6931471805599454, u'of': 1.6931471805599454, u'is': 1.0, u'football': 1.2876820724517808, u'epping': 1.2876820724517808, u'bad': 1.6931471805599454, u'london': 1.6931471805599454, u'tom': 1.6931471805599454, u'my': 1.6931471805599454, u'hello': 1.6931471805599454, u'really': 1.6931471805599454}

As you can see it does the TF-IDF for the whole documents for both text files rather than per line. 如您所见，它对两个文本文件而不是每一行都对整个文档执行TF-IDF。 I have tried to implement per line using the for loop but i cannot figure out the problem. 我已经尝试过使用for循环来实现每行，但是我无法弄清楚问题所在。

Ideally the output would print the TF-IDF per line. 理想情况下，输出将每行打印TF-IDF。 Eg 例如

u'hello': 0.23123, u'my': 0.3123123, u'name': '0.2313213, u'is': 0.3213132, u'tom': 0.3214344

etc. 等等

If anyone can help me or give any advice that would be great. 如果有人可以帮助我或提供任何建议，那将是很好的。

Answer 1

Ehm... here you're passing a and b: 嗯...您在这里传递a和b：

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a, b)
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

When a and b are arrays... (list of strings). 当a和b是数组时...（字符串列表）。 What you could do is this: 您可以执行以下操作：

for i in range(min(len(a), len(b))):
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(a[i], b[i])
    idf = vectorizer.idf_
    print dict(zip(vectorizer.get_feature_names(), idf))

But as it is mentioned in the comments it is not clear what you are doing... 但是正如评论中提到的那样，目前尚不清楚您在做什么...

TF-IDF按字符串行而不是全文文件

问题描述

1 个解决方案

解决方案1
1 2015-04-08 11:37:45

TF-IDF按字符串行而不是全文文件

问题描述

1 个解决方案

解决方案1 1 2015-04-08 11:37:45

解决方案1
1 2015-04-08 11:37:45