文件清单的tfidf

Question

I have a list of documents(TDT2 corpus) and I want to get a vocabulary from it using tfidf. 我有一个文档列表（TDT2语料库），我想使用tfidf从中获取词汇。 Using textblob is taking forever and I don't see it producing a vocabulary before 5-6 days given the speed. 考虑到速度，使用textblob会花很多时间，而且我认为它不会在5-6天之内产生词汇量。 Is there any other technique to go about this? 还有其他方法可以做到这一点吗？ I came across scikit-learn's tfidf technique but I am afraid it too will take the same amount of time. 我遇到了scikit-learn的tfidf技术，但我担心它也将花费相同的时间。

    from sklearn.feature_extraction.text import CountVectorizer

    results = []
    with open("/Users/mxyz/Documents/wholedata/X_train.txt") as f:
        for line in f:
            results.append(line.strip().split('\n'))

    blob=[]
    for line in results:
        blob.append(line)


    count_vect= CountVectorizer()


   counts=count_vect.fit_transform(blob)
   print(counts.shape)

This keeps producing an error about not accepting a list and that list does not have lower. 这将不断产生关于不接受列表的错误，并且该列表没有更低的列表。

Answer 1

I assume results should just be a list , not a list of list s? 我假设results应该只是一个list ，而不是list的list ？ If that's the case, change: 如果是这样，请更改：

results.append(line.strip().split('\n'))

to: 至：

results.extend(line.strip().split('\n'))

append is adding the whole list returned by split as a single element in the results list ; append将split返回的整个list作为一个元素添加在results list ; extend is adding the items from that list to results individually. extend是将list的项目分别添加到results 。

Side-note: As written 旁注：按书面规定

blob=[]
for line in results:
    blob.append(line)

is just doing a shallow copy of results the slow way. 只是缓慢地复制results 。 You can replace that with either blob = results[:] or blob = list(results) (the latter is slower, but if you didn't know what sort of iterable results was and needed it to be a list and nothing else, that's how you'd do it). 您可以将其替换为blob = results[:]或blob = list(results) （后者速度较慢，但是如果您不知道是哪种可迭代results ，并且需要将其作为list ，别无其他，那就是你会怎么做）。

文件清单的tfidf

问题描述

1 个解决方案

解决方案1
1 已采纳 2015-12-11 02:48:35

文件清单的tfidf

问题描述

1 个解决方案

解决方案1 1 已采纳 2015-12-11 02:48:35

解决方案1
1 已采纳 2015-12-11 02:48:35