[英]Calculate tf-idf of strings
我有2个文档doc1.txt
和doc2.txt
。 这两个文档的内容是:
#doc1.txt
very good, very bad, you are great
#doc2.txt
very bad, good restaurent, nice place to visit
我想用分隔我的语料库,
以便最终的DocumentTermMatrix
变为:
terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
我知道如何计算单个单词的DocumentTermMatrix
(使用http://scikit-learn.org/stable/modules/feature_extraction.html ),但不知道如何在Python中计算strings
DocumentTermMatrix
。
您可以将TfidfVectorizer
的analyzer
参数指定为以自定义方式提取特征的函数:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['very good, very bad, you are great',
'very bad, good restaurent, nice place to visit']
tfidf = TfidfVectorizer(analyzer=lambda d: d.split(', ')).fit(docs)
print tfidf.get_feature_names()
产生的功能是:
['good restaurent', 'nice place to visit', 'very bad', 'very good', 'you are great']
如果您真的负担不起将所有数据加载到内存中,这是一种解决方法:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['doc1.txt', 'doc2.txt']
def extract(filename):
with open(filename) as f:
features = []
for line in f:
features += line.strip().split(', ')
return features
tfidf = TfidfVectorizer(analyzer=extract).fit(docs)
print tfidf.get_feature_names()
一次加载每个文档,而无需一次将所有文档保存在内存中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.