简体   繁体   English

文件清单的tfidf

[英]Tfidf of a list of documents

I have a list of documents(TDT2 corpus) and I want to get a vocabulary from it using tfidf. 我有一个文档列表(TDT2语料库),我想使用tfidf从中获取词汇。 Using textblob is taking forever and I don't see it producing a vocabulary before 5-6 days given the speed. 考虑到速度,使用textblob会花很多时间,而且我认为它不会在5-6天之内产生词汇量。 Is there any other technique to go about this? 还有其他方法可以做到这一点吗? I came across scikit-learn's tfidf technique but I am afraid it too will take the same amount of time. 我遇到了scikit-learn的tfidf技术,但我担心它也将花费相同的时间。

    from sklearn.feature_extraction.text import CountVectorizer

    results = []
    with open("/Users/mxyz/Documents/wholedata/X_train.txt") as f:
        for line in f:
            results.append(line.strip().split('\n'))

    blob=[]
    for line in results:
        blob.append(line)


    count_vect= CountVectorizer()


   counts=count_vect.fit_transform(blob)
   print(counts.shape)

This keeps producing an error about not accepting a list and that list does not have lower. 这将不断产生关于不接受列表的错误,并且该列表没有更低的列表。

I assume results should just be a list , not a list of list s? 我假设results应该只是一个list ,而不是listlist If that's the case, change: 如果是这样,请更改:

results.append(line.strip().split('\n'))

to: 至:

results.extend(line.strip().split('\n'))

append is adding the whole list returned by split as a single element in the results list ; appendsplit返回的整个list作为一个元素添加在results list ; extend is adding the items from that list to results individually. extend是将list的项目分别添加到results

Side-note: As written 旁注:按书面规定

blob=[]
for line in results:
    blob.append(line)

is just doing a shallow copy of results the slow way. 只是缓慢地复制results You can replace that with either blob = results[:] or blob = list(results) (the latter is slower, but if you didn't know what sort of iterable results was and needed it to be a list and nothing else, that's how you'd do it). 您可以将其替换为blob = results[:]blob = list(results) (后者速度较慢,但​​是如果您不知道是哪种可迭代results ,并且需要将其作为list ,别无其他,那就是你会怎么做)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 TFIDF矢量化标记文档时出现问题吗? - Problems while TFIDF vectorizing tokenized documents? 使 tfidf 向量化器作为文档数量的特征返回 - Make tfidf vectorizer returns as features as the number of documents 将生成的TFIDF稀疏向量链接到Spark中的原始文档 - Linking the resulting TFIDF sparse vectors to the original documents in Spark 对于Lucene(PyLucene)中的每个文档,获得TFIDF得分最高的N个术语 - Get N terms with top TFIDF scores for each documents in Lucene (PyLucene) 将生成的TFIDF计算添加到Pyspark中原始文档的数据框中 - Adding the resulting TFIDF calculation to the dataframe of the original documents in Pyspark (文本分类)处理相同的单词,但来自不同的文件[TFIDF] - (Text Classification) Handling same words but from different documents [TFIDF ] Gensim Tfidf model 正在返回空权重列表 - Gensim Tfidf model is returning empty weights list IndexError:执行TFIDF时,Python中的列表分配索引超出范围 - IndexError: list assignment index out of range in Python while performing TFIDF 当我们只提供单个单词的向量时,kmeans如何知道如何对文档进行聚类? - How does kmeans know how to cluster documents when we only feed it tfidf vectors of individual words? 我应该使用tfidf语料库还是仅使用语料库来推断使用LDA的文档? - should i use tfidf corpus or just corpus to inference documents using LDA?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM