简体   繁体   English

导入CSV所需的时间太长

[英]Importing CSV takes too long

The Problem 问题

I am writing an app engine Karaoke Catalogs app. 我正在编写一个应用程序引擎Karaoke Catalogs应用程序。 The app is very simple: in the first release, it offers the ability to import CSV song lists into catalogs and display them. 该应用程序非常简单:在第一个版本中,它提供了将CSV歌曲列表导入目录并显示它们的功能。

I am having problem with CSV import: it takes a very long time (14 hours) to import 17,500 records in my development environment. 我在CSV导入方面遇到问题:在我的开发环境中,导入17500条记录需要花费很长时间(14小时)。 In the production environment, it imports about 1000 records, then crashed with code 500. I am going through the logs, but did not find any useful clues. 在生产环境中,它导入约1000条记录,然后崩溃,并显示代码500。我正在浏览日志,但没有找到任何有用的线索。

The Code 编码

class Song(ndb.Model):
    sid     = ndb.IntegerProperty()
    title   = ndb.StringProperty()
    singer  = ndb.StringProperty()
    preview = ndb.StringProperty()

    @classmethod
    def new_from_csv_row(cls, row, parent_key):
        song = Song(
                sid=int(row['sid']),
                title=row['title'],
                singer=row['singer'],
                preview=row['preview'],
                key=ndb.Key(Song, row['sid'], parent=parent_key))
        return song

class CsvUpload(webapp2.RequestHandler):
    def get(self):
        # code omit for brevity 

    def post(self):
        catalog = get_catalog(…) # retrieve old catalog or create new

        # upfile is the contents of the uploaded file, not the filename
        # because the form uses enctype="multipart/form-data"
        upfile = self.request.get('upfile')

        # Create the songs
        csv_reader = csv.DictReader(StringIO(upfile))
        for row in csv_reader:
            song = Song.new_from_csv_row(row, catalog.key)
            song.put()

        self.redirect('/upload')

Sample Data 样本数据

sid,title,singer,preview
19459,Zoom,Commodores,
19460,Zoot Suit Riot,Cherry Poppin Daddy,
19247,You Are Not Alone,Michael Jackson,Another day has gone. I'm still all alone

Notes 笔记

  • In the development environment, I tried to import up to 17,500 records and experienced no crashing 在开发环境中,我尝试导入多达17,500条记录,并且没有崩溃
  • At first, records are created and inserted quickly, but as the database grows into the thousands, the time it took to create and insert records increases to a few seconds per record. 最初,记录的创建和插入很快,但是随着数据库的发展,成千上万的记录创建和插入记录所花费的时间增加到每条记录几秒钟。

How do I speed up the import operation? 如何加快导入操作? Any suggestion, hint, or tip will be greatly appreciated. 任何建议,提示或技巧将不胜感激。

Update 更新资料

I followed Murph's advice and used a KeyProperty to link a song back to the catalog. 我听从了Murph的建议,并使用KeyProperty将歌曲链接回目录。 The result is about 4 minutes and 20 seconds for 17,500 records--a huge improvement. 结果是大约4分20秒,可记录17,500条记录-有了巨大的进步。 That means, I did not fully understood how NDB works in App Engine and I still have a long way to learn. 这意味着,我还没有完全理解NDB在App Engine中的工作方式,并且还有很长的路要走。

While a big improvement, 4+ minutes is admittedly still too long. 虽然有很大的进步,但公认的4分钟以上仍然太长。 I am now looking into Tim's and Dave's advises further shorten the perceived response time of my app. 我现在正在研究Tim's和Dave的建议,进一步缩短了我的应用程序的感知响应时间。

In Google App Engine's Datastore, writes to an Entity Group are restricted to 1 write per second. 在Google App Engine的数据存储区中,对实体组的写入限制为每秒1次写入。

Since you are specifying a "parent" key for every Song, they all end up in one Entity Group, which is very slow. 由于您为每首乐曲都指定了一个“父”键,因此它们都以一个实体组结尾,这很慢。

Would it be acceptable for your use to just use a KeyProperty to keep track of that relationship? 仅仅使用KeyProperty来跟踪这种关系是否可以接受? That would be much faster, though the data might have more consistency issues. 尽管数据可能存在更多的一致性问题,但这将更快。

In addition to the other answer re: entity groups, if the import process is going to take longer than 60 secs, use a task, then you have 10min run time. 除了其他答案:实体组,如果导入过程将花费超过60秒的时间,请使用任务,那么您的运行时间为10分钟。

Store the csv as a BlobProperty in an entity (if compressed <1MB) or GCS for larger, then fire off a task which retrieves the CSV from storage then does the processing. 将csv作为BlobProperty存储在实体中(如果压缩后<1MB)或GCS,则存储更大的文件,然后启动任务,该任务从存储中检索CSV,然后进行处理。

First off, Tim is on the right track. 首先,蒂姆走在正确的轨道上。 If you can't get work done within 60 seconds, defer to a task. 如果您无法在60秒内完成工作,请执行任务。 But if you can't get work done within 10 minutes, fall back on App Engine MapReduce , which apportions the work of processing your csv across multiple tasks. 但是,如果您无法在10分钟内完成工作,请改用App Engine MapReduce ,它可以将处理csv的工作分配给多个任务。 Consult the demo program , which has some of the pieces that you would need. 请参阅演示程序 ,其中包含您需要的一些片段。

For development-time slowness, are you using the --use_sqlite option when starting the dev_appserver? 为了降低开发时间,启动dev_appserver时是否使用--use_sqlite选项?

Murph touches on the other part of your problem. Murph谈到了问题的另一部分。 Using Entity groups, you're rate-limited on how many inserts (per Entity group) you can do. 使用实体组,您可以限制插入次数(每个实体组)。 Trying to insert 17,500 rows using a single parent isn't going to work at all well. 试图使用单亲父母插入17,500行根本无法正常工作。 That'll take about 5 hours. 这大约需要5个小时。

So, do you really need consistent reads? 那么,您真的需要一致的读取吗? If this is a one-time upload, can you do non-ancestor inserts (with the catalog as a property), then wait a bit for the data to become eventually consistent? 如果这是一次上载,则可以进行非祖先插入(以目录作为属性),然后稍等一下数据最终变得一致吗? This simplifies querying. 这简化了查询。

If you really, absolutely need consistent reads, you'll probably need to split your writes across multiple parents keys. 如果确实需要绝对一致的读取,则可能需要将写入拆分成多个父键。 This will increase your write rate, at the expense of making your ancestor queries more complicated. 这将增加您的写入率,但以使祖先查询更复杂为代价。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM