简体   繁体   中英

Load PostgreSQL database with data from a NetCDF file

I have a netCDF file with eight variables. (sorry, can´t share the actual file) Each variable have two dimensions, time and station. Time is about 14 steps and station is currently 38000 different ids. So for 38000 different "locations" (actually just an id) we have 8 variables and 14 different times.

$ncdump -h stationdata.nc
netcdf stationdata {
dimensions:
    station = 38000 ;
    name_strlen = 40 ;
    time = UNLIMITED ; // (14 currently)
variables:
    int time(time) ;
            time:long_name = "time" ;
            time:units = "seconds since 1970-01-01" ;
    char station_name(station, name_strlen) ;
            station_name:long_name = "station_name" ;
            station_name:cf_role = "timeseries_id" ;
    float var1(time, station) ;
            var1:long_name = "Variable 1" ;
            var1:units = "m3/s" ;
    float var2(time, station) ;
            var2:long_name = "Variable 2" ;
            var2:units = "m3/s" ;
...

This data needs to be loaded into a PostGres database so that the data can be join to some geometries matching the station_name for later visualization .

Currently I have done this in Python with the netCDF4-module. Works but it takes forever! Now I am looping like this:

times = rootgrp.variables['time']
stations = rootgrp.variables['station_name']
for timeindex, time in enumerate(times):
    stations = rootgrp.variables['station_name']
    for stationindex, stationnamearr in enumerate(stations):
        var1val = var1[timeindex][stationindex]
        print "INSERT INTO ncdata (validtime, stationname, var1) \
            VALUES ('%s','%s', %s);" % \
            ( time, stationnamearr, var1val )

This takes several minutes on my machine to run and I have a feeling it could be done in a much more clever way.

Anyone has any idea on how this can be done in a smarter way? Preferably in Python.

Not sure this is the right way to do it but I found a good way to solve this and thought I should share it.

In the first version the script took about one hour to run. After a rewrite of the code it now runs in less than 30 sec!

The big thing was to use numpy arrays and transponse the variables arrays from the NetCDF reader to become rows and then stack all columns to one matrix. This matrix was then loaded in the db using psycopg2 copy_from function. I got the code for that from this question

Use binary COPY table FROM with psycopg2

Parts of my code:

dates = num2date(rootgrp.variables['time'][:],units=rootgrp.variables['time'].units)
var1=rootgrp.variables['var1']
var2=rootgrp.variables['var2']

cpy = cStringIO.StringIO()

for timeindex, time in enumerate(dates):

    validtimes=np.empty(var1[timeindex].size, dtype="object")
    validtimes.fill(time)

    #  Transponse and stack the arrays of parameters
    #    [a,a,a,a]        [[a,b,c],
    #    [b,b,b,b]  =>     [a,b,c],
    #    [c,c,c,c]         [a,b,c],
    #                      [a,b,c]]

    a = np.hstack((
              validtimes.reshape(validtimes.size,1),
              stationnames.reshape(stationnames.size,1),
              var1[timeindex].reshape(var1[timeindex].size,1),
              var2[timeindex].reshape(var2[timeindex].size,1)
    ))

    # Fill the cStringIO with text representation of the created array
    for row in a:
            cpy.write(row[0].strftime("%Y-%m-%d %H:%M")+'\t'+ row[1] +'\t' + '\t'.join([str(x) for x in row[2:]]) + '\n')


conn = psycopg2.connect("host=postgresserver dbname=nc user=user password=passwd")
curs = conn.cursor()

cpy.seek(0)
curs.copy_from(cpy, 'ncdata', columns=('validtime', 'stationname', 'var1', 'var2'))
conn.commit()

There are a few simple improvements you can make to speed this up. All these are independent, you can try all of them or just a couple to see if it's fast enough. They're in roughly ascending order of difficulty:

  • Use the psycopg2 database driver, it's faster
  • Wrap the whole block of inserts in a transaction. If you're using psycopg2 you're already doing this - it auto-opens a transaction you have to commit at the end.
  • Collect up several rows worth of values in an array and do a multi-valued INSERT every n rows.
  • Use more than one connection to do the inserts via helper processes - see the multiprocessing module. Threads won't work as well because of GIL (global interpreter lock) issues.

If you don't want to use one big transaction you can set synchronous_commit = off and set a commit_delay so the connection can return before the disk flush actually completes. This won't help you much if you're doing all the work in one transaction.

Multi-valued inserts

Psycopg2 doesn't directly support multi-valued INSERT but you can just write:

curs.execute("""
INSERT INTO blah(a,b) VALUES
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s);
""", parms);

and loop with something like:

parms = []
rownum = 0
for x in input_data:
    parms.extend([x.firstvalue, x.secondvalue])
    rownum += 1
    if rownum % 5 == 0:
        curs.execute("""INSERT ...""", tuple(parms))
        del(parms[:])

Organize your loop to access all the variables for each time. In other words, read and write a record at a time rather than a variable at a time. This can speed things up enormously, especially if the source netCDF dataset is stored on a file system with large disk blocks, eg 1MB or larger. For an explanation of why this is faster and a discussion of order-of-magnitude resulting speedups, see this NCO speedup discussion , starting with entry 7.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM