[英]python gensim word2vec gives typeerror TypeError: object of type 'generator' has no len() on custom dataclass
I am trying to get word2vec to work in python3, however as my dataset is too large to easily fit in memory I am loading it via an iterator (from zip files). 我试图让word2vec在python3中工作,但是由于我的数据集太大而无法轻松放入内存中,因此我正在通过迭代器(从zip文件)加载它。 However when I run it I get the error
但是,当我运行它时,我得到了错误
Traceback (most recent call last):
File "WordModel.py", line 85, in <module>
main()
File "WordModel.py", line 15, in main
word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__
fast_version=FAST_VERSION)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 759, in __init__
self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 936, in build_vocab
sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1591, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1576, in _scan_vocab
total_words += len(sentence)
TypeError: object of type 'generator' has no len()
Here is the code: 这是代码:
import zipfile
import os
from ast import literal_eval
from lxml import etree
import io
import gensim
from multiprocessing import cpu_count
def main():
data = TrainingData("/media/thijser/Data/DataSets/uit2")
print(len(data))
word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
word2vec.save('word2vec.save')
class TrainingData:
size=-1
def __init__(self, dirname):
self.data_location = dirname
def __len__(self):
if self.size<0:
for zipfile in self.get_zips_in_folder(self.data_location):
for text_file in self.get_files_names_from_zip(zipfile):
self.size=self.size+1
return self.size
def __iter__(self): #might not fit in memory otherwise
yield self.get_data()
def get_data(self):
for zipfile in self.get_zips_in_folder(self.data_location):
for text_file in self.get_files_names_from_zip(zipfile):
yield self.preproccess_text(text_file)
def stripXMLtags(self,text):
tree=etree.parse(text)
notags=etree.tostring(tree, encoding='utf8', method='text')
return notags.decode("utf-8")
def remove_newline(self,text):
text.replace("\\n"," ")
return text
def preproccess_text(self,text):
text=self.stripXMLtags(text)
text=self.remove_newline(text)
return text
def get_files_names_from_zip(self,zip_location):
files=[]
archive = zipfile.ZipFile(zip_location, 'r')
for info in archive.infolist():
files.append(archive.open(info.filename))
return files
def get_zips_in_folder(self,location):
zip_files = []
for root, dirs, files in os.walk(location):
for name in files:
if name.endswith((".zip")):
filepath=root+"/"+name
zip_files.append(filepath)
return zip_files
main()
for d in data:
for dd in d :
print(type(dd))
Does show me that dd is of the type string and contains the correct preprocessed strings (with length somewhere between 50 and 5000 words each). 确实告诉我dd是字符串类型,并且包含正确的预处理字符串(每个字符串的长度在50到5000个单词之间)。
Update after discussion: 讨论后更新:
Your TrainingData
class __iter__()
function isn't providing a generator which returns each text in turn, but rather a generator which returns a single other generator. 您的
TrainingData
类__iter__()
函数没有提供一个生成器,该生成器依次返回每个文本,而是提供了一个返回单个其他生成器的生成器。 (There's one too many levels of yield
.) That's not what Word2Vec
is expecting. (
yield
水平太多。)这不是Word2Vec
所期望的。
Changing the body of your __iter__()
method to simply... 将您的
__iter__()
方法的主体更改为简单...
return self.get_data()
...so that __iter__()
is a synonym for your get_data()
, and just returns the same text-by-text generator that get_data()
does, should help. ......使
__iter__()
是你的代名词get_data()
和刚刚返回相同的文字,通过文字发生器get_data()
不,应该有所帮助。
Original answer: 原始答案:
You're not showing the TrainingData.preproccess_text()
(sic) method, referenced inside get_data()
, which is what is actually creating the data Word2Vec
is processing. 您没有显示在
get_data()
引用的TrainingData.preproccess_text()
(sic)方法,该方法实际上是创建Word2Vec
正在处理的数据的Word2Vec
。 And, it's that data that's generating the error. 而且,正是这些数据产生了错误。
Word2Vec
requires its sentences
corpus be an iterable sequence (for which a generator would be appropriate) where each individual item is a list-of-string-tokens . Word2Vec
要求其sentences
语料库是一个可迭代的序列 (适用于生成器),其中每个单独的项目都是一个字符串令牌列表 。
From that error, it looks like the individual items in your TrainingData
sequence may themselves be generators, rather than lists with a readable len()
. 从该错误看来,您的
TrainingData
序列中的各个项目本身可能是生成器,而不是具有可读len()
列表。
(Separately, if perchance you're choosing to using generators there because the individual texts may be very very long, be aware that gensim Word2Vec
and related classes only train on individual texts with a length up to 10000 word-tokens. Any words past the 10000th will be silently ignored. If that's a concern, your source texts should be pre-broken into individual texts of 10000 tokens or fewer.) (另外,如果您可能选择在此处使用生成器,因为单个文本可能会非常长,请注意,gensim
Word2Vec
和相关类仅针对长度不超过10000个单词标记的单个文本进行训练。第10000个字符将被静默忽略。如果您对此有所担心,则应将您的源文本预分解为10000个令牌或更少的单个文本。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.