简体   繁体   中英

Large Import into django postgres database

I have a CSV file with 4,500,000 rows in it that needs to be imported into my django postgres database. This files includes relations so it isn't as easy as using COPY to import the CSV file straight into the database.

If I wanted to load it straight into postgres, I can change the CSV file to match the database tables, but I'm not sure how to get the relationship since I need to know the inserted id in order to build the relationship.

Is there a way to generate sql inserts that will get the last id and use that in future statements?

I initially wrote this using django ORM, but its going to take way to long to do that and it seems to be slowing down. I removed all of my indexes and contraints, so that shouldn't be the issue.

The database is running locally on my machine. I figured once I get the data into a database, it wouldn't be hard to dump and reload it on the production database.

So how can I get this data into my database with the correct relationships?

Note that I don't know JAVA so the answer suggested here isn't super practical for me: Django with huge mysql database

EDIT: Here are more details:

I have a model something like this:

class Person(models.Model):
    name = models.CharField(max_length=100)
    offices = models.ManyToManyField(Office)
    job = models.ForeignKey(Job)

class Office(models.Model):
    address = models.CharField(max_length=100)

class Job(models.Model):
    title = models.CharField(max_length=100)

So I have a person who can have 1 job but many offices. (My real model has more fields, but you get the idea).

My CSV file is something like this:

name,office_1,office_2,job
hailey,"123 test st","222 USA ave.",Programmer

There are more fields than that, but I'm only including the relevant ones.

So I need to make the person object and the office objects and relate them. The job objects are already created so all I need to do there is find the job and save it as the person's job.

The original data was not in a database before this. Only the flat file. We are trying to make it relational so there is more flexibility.

Thanks!!!

Well this is though one.

When you say relations, they are all on a single CSV file? I mean, like this, presuming a simple data model, with a relation to itself?

id;parent_id;name
4;1;Frank
1;;George
2;1;Costanza
3;1;Stella

If this is the case and it's out of order, I would write a Python script to reorder these and then import them.

I had a scenario a while back that I had a number of CSV files, but they were from individual models, where I loaded the first parent one, then the second, etc.

We wrote here custom importers that would read the data from a single CSV, and would do some processing on it, like check if it already existed, if some things were valid, etc. A method for each CSV file.

For CSV's that were big enough, we just split them in smaller files (around 200k records each) and processed them one after the other. The difference is that all the previous data that this big CSV depended on, was already in the database, imported by the same method described previously.

Without an example, I can't comment much more.

EDIT

Well, since you gave us your model, and based on the fact that the job model is already there, I would go for something like this:

  1. create a custom method, even if you one n you can invoke from the shell. A method/function or whatever, that will receive a single line of the file.
  2. In that method, discover how many offices that person is related to. Search to see if the office already exists in the DB. If so, use it to relate a person and the office. If not, create it and relate them
  3. Lookup for the job. Does it exist? Yes, then use it. No? Create it and then use it.

Something like this:

def process_line(line):

    data = line.split(";")
    person = Person()
    # fill in the person details that are in the CSV
    person.name = data[1]
    person.name = data[2]
    person.save() # you'll need to save to use the m2m

    offices = get_offices_from_line(line) # returns the plain data, not office instances

    for office in offices:

        obj, create = get_or_create(office_address=office)
        if (obj):
            person.offices.add(obj)

        if (create):
            person.offices.add(create)

    job_obj, job_create = get_or_create(job_title=data[5])
    # repeat

Be aware that the function above was not tested or guarded against any kind of errors. You'll need to:

  1. Do that yourself;
  2. Create the function that identifies the offices each person has. I don't know the data, but perhaps if you look at the field preceding the first office and look until the first field after all the offices you'll be able to grasp all of them;
  3. You'll need to create a function to parse the high level file, iterate the lines and pass them along your shiny import function.

Here are the docs for get_or_create: https://docs.djangoproject.com/en/1.8/ref/models/querysets/#get-or-create

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM