简体   繁体   中英

Importing CSV takes too long

The Problem

I am writing an app engine Karaoke Catalogs app. The app is very simple: in the first release, it offers the ability to import CSV song lists into catalogs and display them.

I am having problem with CSV import: it takes a very long time (14 hours) to import 17,500 records in my development environment. In the production environment, it imports about 1000 records, then crashed with code 500. I am going through the logs, but did not find any useful clues.

The Code

class Song(ndb.Model):
    sid     = ndb.IntegerProperty()
    title   = ndb.StringProperty()
    singer  = ndb.StringProperty()
    preview = ndb.StringProperty()

    @classmethod
    def new_from_csv_row(cls, row, parent_key):
        song = Song(
                sid=int(row['sid']),
                title=row['title'],
                singer=row['singer'],
                preview=row['preview'],
                key=ndb.Key(Song, row['sid'], parent=parent_key))
        return song

class CsvUpload(webapp2.RequestHandler):
    def get(self):
        # code omit for brevity 

    def post(self):
        catalog = get_catalog(…) # retrieve old catalog or create new

        # upfile is the contents of the uploaded file, not the filename
        # because the form uses enctype="multipart/form-data"
        upfile = self.request.get('upfile')

        # Create the songs
        csv_reader = csv.DictReader(StringIO(upfile))
        for row in csv_reader:
            song = Song.new_from_csv_row(row, catalog.key)
            song.put()

        self.redirect('/upload')

Sample Data

sid,title,singer,preview
19459,Zoom,Commodores,
19460,Zoot Suit Riot,Cherry Poppin Daddy,
19247,You Are Not Alone,Michael Jackson,Another day has gone. I'm still all alone

Notes

  • In the development environment, I tried to import up to 17,500 records and experienced no crashing
  • At first, records are created and inserted quickly, but as the database grows into the thousands, the time it took to create and insert records increases to a few seconds per record.

How do I speed up the import operation? Any suggestion, hint, or tip will be greatly appreciated.

Update

I followed Murph's advice and used a KeyProperty to link a song back to the catalog. The result is about 4 minutes and 20 seconds for 17,500 records--a huge improvement. That means, I did not fully understood how NDB works in App Engine and I still have a long way to learn.

While a big improvement, 4+ minutes is admittedly still too long. I am now looking into Tim's and Dave's advises further shorten the perceived response time of my app.

In Google App Engine's Datastore, writes to an Entity Group are restricted to 1 write per second.

Since you are specifying a "parent" key for every Song, they all end up in one Entity Group, which is very slow.

Would it be acceptable for your use to just use a KeyProperty to keep track of that relationship? That would be much faster, though the data might have more consistency issues.

In addition to the other answer re: entity groups, if the import process is going to take longer than 60 secs, use a task, then you have 10min run time.

Store the csv as a BlobProperty in an entity (if compressed <1MB) or GCS for larger, then fire off a task which retrieves the CSV from storage then does the processing.

First off, Tim is on the right track. If you can't get work done within 60 seconds, defer to a task. But if you can't get work done within 10 minutes, fall back on App Engine MapReduce , which apportions the work of processing your csv across multiple tasks. Consult the demo program , which has some of the pieces that you would need.

For development-time slowness, are you using the --use_sqlite option when starting the dev_appserver?

Murph touches on the other part of your problem. Using Entity groups, you're rate-limited on how many inserts (per Entity group) you can do. Trying to insert 17,500 rows using a single parent isn't going to work at all well. That'll take about 5 hours.

So, do you really need consistent reads? If this is a one-time upload, can you do non-ancestor inserts (with the catalog as a property), then wait a bit for the data to become eventually consistent? This simplifies querying.

If you really, absolutely need consistent reads, you'll probably need to split your writes across multiple parents keys. This will increase your write rate, at the expense of making your ancestor queries more complicated.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM