简体   繁体   English

python gensim word2vec给出TypeError TypeError:类型'generator'的对象在自定义数据类上没有len()

[英]python gensim word2vec gives typeerror TypeError: object of type 'generator' has no len() on custom dataclass

I am trying to get word2vec to work in python3, however as my dataset is too large to easily fit in memory I am loading it via an iterator (from zip files). 我试图让word2vec在python3中工作,但是由于我的数据集太大而无法轻松放入内存中,因此我正在通过迭代器(从zip文件)加载它。 However when I run it I get the error 但是,当我运行它时,我得到了错误

Traceback (most recent call last):
  File "WordModel.py", line 85, in <module>
    main()
  File "WordModel.py", line 15, in main
    word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 783, in __init__
    fast_version=FAST_VERSION)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 759, in __init__
    self.build_vocab(sentences=sentences, corpus_file=corpus_file, trim_rule=trim_rule)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/base_any2vec.py", line 936, in build_vocab
    sentences=sentences, corpus_file=corpus_file, progress_per=progress_per, trim_rule=trim_rule)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1591, in scan_vocab
    total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
  File "/home/thijser/.local/lib/python3.7/site-packages/gensim/models/word2vec.py", line 1576, in _scan_vocab
    total_words += len(sentence)
TypeError: object of type 'generator' has no len()

Here is the code: 这是代码:

import zipfile
import os
from ast import literal_eval

from lxml import etree
import io
import gensim

from multiprocessing import cpu_count


def main():
    data = TrainingData("/media/thijser/Data/DataSets/uit2")
    print(len(data))
    word2vec = gensim.models.Word2Vec(data,workers=cpu_count())
    word2vec.save('word2vec.save')




class TrainingData:

    size=-1

    def __init__(self, dirname):
        self.data_location = dirname

    def __len__(self):
        if self.size<0: 

            for zipfile in self.get_zips_in_folder(self.data_location): 
                for text_file in self.get_files_names_from_zip(zipfile):
                    self.size=self.size+1
        return self.size            

    def __iter__(self): #might not fit in memory otherwise
        yield self.get_data()

    def get_data(self):


        for zipfile in self.get_zips_in_folder(self.data_location): 
            for text_file in self.get_files_names_from_zip(zipfile):
                yield self.preproccess_text(text_file)


    def stripXMLtags(self,text):

        tree=etree.parse(text)
        notags=etree.tostring(tree, encoding='utf8', method='text')
        return notags.decode("utf-8") 

    def remove_newline(self,text):
        text.replace("\\n"," ")
        return text

    def preproccess_text(self,text):
        text=self.stripXMLtags(text)
        text=self.remove_newline(text)

        return text




    def get_files_names_from_zip(self,zip_location):
        files=[]
        archive = zipfile.ZipFile(zip_location, 'r')

        for info in archive.infolist():
            files.append(archive.open(info.filename))

        return files

    def get_zips_in_folder(self,location):
       zip_files = []
       for root, dirs, files in os.walk(location):
            for name in files:
                if name.endswith((".zip")): 
                    filepath=root+"/"+name
                    zip_files.append(filepath)

       return zip_files

main()


for d in data:
    for dd in d :
        print(type(dd))

Does show me that dd is of the type string and contains the correct preprocessed strings (with length somewhere between 50 and 5000 words each). 确实告诉我dd是字符串类型,并且包含正确的预处理字符串(每个字符串的长度在50到5000个单词之间)。

Update after discussion: 讨论后更新:

Your TrainingData class __iter__() function isn't providing a generator which returns each text in turn, but rather a generator which returns a single other generator. 您的TrainingData__iter__()函数没有提供一个生成器,该生成器依次返回每个文本,而是提供了一个返回单个其他生成器的生成器。 (There's one too many levels of yield .) That's not what Word2Vec is expecting. yield水平太多​​。)这不是Word2Vec所期望的。

Changing the body of your __iter__() method to simply... 将您的__iter__()方法的主体更改为简单...

return self.get_data()

...so that __iter__() is a synonym for your get_data() , and just returns the same text-by-text generator that get_data() does, should help. ......使__iter__()是你的代名词get_data()和刚刚返回相同的文字,通过文字发生器get_data()不,应该有所帮助。

Original answer: 原始答案:

You're not showing the TrainingData.preproccess_text() (sic) method, referenced inside get_data() , which is what is actually creating the data Word2Vec is processing. 您没有显示在get_data()引用的TrainingData.preproccess_text() (sic)方法,该方法实际上是创建Word2Vec正在处理的数据的Word2Vec And, it's that data that's generating the error. 而且,正是这些数据产生了错误。

Word2Vec requires its sentences corpus be an iterable sequence (for which a generator would be appropriate) where each individual item is a list-of-string-tokens . Word2Vec要求其sentences语料库是一个可迭代的序列 (适用于生成器),其中每个单独的项目都是一个字符串令牌列表

From that error, it looks like the individual items in your TrainingData sequence may themselves be generators, rather than lists with a readable len() . 从该错误看来,您的TrainingData序列中的各个项目本身可能是生成器,而不是具有可读len()列表。

(Separately, if perchance you're choosing to using generators there because the individual texts may be very very long, be aware that gensim Word2Vec and related classes only train on individual texts with a length up to 10000 word-tokens. Any words past the 10000th will be silently ignored. If that's a concern, your source texts should be pre-broken into individual texts of 10000 tokens or fewer.) (另外,如果您可能选择在此处使用生成器,因为单个文本可能会非常长,请注意,gensim Word2Vec和相关类仅针对长度不超过10000个单词标记的单个文本进行训练。第10000个字符将被静默忽略。如果您对此有所担心,则应将您的源文本预分解为10000个令牌或更少的单个文本。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM