计算字符串的tf-idf

Question

我有2个文档doc1.txt和doc2.txt 。 这两个文档的内容是：

 #doc1.txt
 very good, very bad, you are great

 #doc2.txt
 very bad, good restaurent, nice place to visit

我想用分隔我的语料库,以便最终的DocumentTermMatrix变为：

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
 doc1       tf-idf          tf-idf         tf-idf          0                    0
 doc2       0               tf-idf         0               tf-idf             tf-idf

我知道如何计算单个单词的DocumentTermMatrix （使用http://scikit-learn.org/stable/modules/feature_extraction.html ），但不知道如何在Python中计算strings DocumentTermMatrix 。

Answer 1

您可以将TfidfVectorizer的analyzer参数指定为以自定义方式提取特征的函数：

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ['very good, very bad, you are great',
        'very bad, good restaurent, nice place to visit']

tfidf = TfidfVectorizer(analyzer=lambda d: d.split(', ')).fit(docs)
print tfidf.get_feature_names()

产生的功能是：

['good restaurent', 'nice place to visit', 'very bad', 'very good', 'you are great']

如果您真的负担不起将所有数据加载到内存中，这是一种解决方法：

from sklearn.feature_extraction.text import TfidfVectorizer

docs = ['doc1.txt', 'doc2.txt']

def extract(filename):
    with open(filename) as f:
        features = []
        for line in f:
            features += line.strip().split(', ')
        return features

tfidf = TfidfVectorizer(analyzer=extract).fit(docs)
print tfidf.get_feature_names()

一次加载每个文档，而无需一次将所有文档保存在内存中。

计算字符串的tf-idf

问题描述

1 个解决方案

解决方案1
5 2014-06-10 08:14:17

计算字符串的tf-idf

问题描述

1 个解决方案

解决方案1 5 2014-06-10 08:14:17

解决方案1
5 2014-06-10 08:14:17