[英]Tfidf of a list of documents
I have a list of documents(TDT2 corpus) and I want to get a vocabulary from it using tfidf. 我有一个文档列表(TDT2语料库),我想使用tfidf从中获取词汇。 Using textblob is taking forever and I don't see it producing a vocabulary before 5-6 days given the speed.
考虑到速度,使用textblob会花很多时间,而且我认为它不会在5-6天之内产生词汇量。 Is there any other technique to go about this?
还有其他方法可以做到这一点吗? I came across scikit-learn's tfidf technique but I am afraid it too will take the same amount of time.
我遇到了scikit-learn的tfidf技术,但我担心它也将花费相同的时间。
from sklearn.feature_extraction.text import CountVectorizer
results = []
with open("/Users/mxyz/Documents/wholedata/X_train.txt") as f:
for line in f:
results.append(line.strip().split('\n'))
blob=[]
for line in results:
blob.append(line)
count_vect= CountVectorizer()
counts=count_vect.fit_transform(blob)
print(counts.shape)
This keeps producing an error about not accepting a list and that list does not have lower. 这将不断产生关于不接受列表的错误,并且该列表没有更低的列表。
I assume results
should just be a list
, not a list
of list
s? 我假设
results
应该只是一个list
,而不是list
的list
? If that's the case, change: 如果是这样,请更改:
results.append(line.strip().split('\n'))
to: 至:
results.extend(line.strip().split('\n'))
append
is adding the whole list
returned by split
as a single element in the results
list
; append
将split
返回的整个list
作为一个元素添加在results
list
; extend
is adding the items from that list
to results
individually. extend
是将list
的项目分别添加到results
。
Side-note: As written 旁注:按书面规定
blob=[]
for line in results:
blob.append(line)
is just doing a shallow copy of results
the slow way. 只是缓慢地复制
results
。 You can replace that with either blob = results[:]
or blob = list(results)
(the latter is slower, but if you didn't know what sort of iterable results
was and needed it to be a list
and nothing else, that's how you'd do it). 您可以将其替换为
blob = results[:]
或blob = list(results)
(后者速度较慢,但是如果您不知道是哪种可迭代results
,并且需要将其作为list
,别无其他,那就是你会怎么做)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.