简体   繁体   English

将Netcdf优化为Python代码

[英]Optimize Netcdf to Python code

I have a python to sql script that reads netcdf file and inserts climatic data to a postgresql table, one row at the time. 我有一个python到sql脚本,可以读取netcdf文件,并将气候数据插入到postgresql表中(当时只有一行)。 This of course takes forever, and now I would like to figure out how I can optimize this code. 这当然是永远的,现在我想弄清楚如何优化此代码。 I have been thinking about making a huge list, and then use the copy command. 我一直在考虑制作一个巨大的列表,然后使用copy命令。 However, I am unsure how one would work that out. 但是,我不确定该如何解决。 Another way might be to write to a csv file and then copy this csv file to the postgres database using the COPY command in Postgresql. 另一种方法是写入csv文件,然后使用Postgresql中的COPY命令将此csv文件复制到postgres数据库。 I guess that would be quicker than inserting one row at a time. 我想这比一次插入一行要快。

If you have any suggestions on how this could be optimized, then I would really appreciate it. 如果您对如何进行优化有任何建议,那么我将不胜感激。 The netcdf file is available here (need to register though): http://badc.nerc.ac.uk/browse/badc/cru/data/cru_ts/cru_ts_3.21/data/pre 可以在此处找到netcdf文件(尽管需要注册): http : //badc.nerc.ac.uk/browse/badc/cru/data/cru_ts/cru_ts_3.21/data/pre

# NetCDF to PostGreSQL database
# CRU-TS 3.21 precipitation and temperature data. From NetCDF to database table
# Requires Python2.6, Postgresql, Psycopg2, Scipy
# Tested using Vista 64bit.

# Import modules
import psycopg2, time, datetime
from scipy.io import netcdf

# Establish connection
db1 = psycopg2.connect("host=192.168.1.162 dbname=dbname user=username password=password")
cur = db1.cursor()
### Create Table
print str(time.ctime())+ " Creating precip table."
cur.execute("DROP TABLE IF EXISTS precip;")
cur.execute("CREATE TABLE precip (gid serial PRIMARY KEY not null, year int, month int, lon decimal, lat decimal, pre decimal);")

### Read netcdf file
f = netcdf.netcdf_file('/home/username/output/project_v2/inputdata/precipitation/cru_ts3.21.1901.2012.pre.dat.nc', 'r')
##
### Create lathash
print str(time.ctime())+ " Looping through lat coords."
temp = f.variables['lat'].data.tolist()
lathash = {}
for entry in temp:
    print str(entry)
    lathash[temp.index(entry)] = entry
##
### Create lonhash
print str(time.ctime())+ " Looping through long coords."
temp = f.variables['lon'].data.tolist()
lonhash = {}
for entry in temp:
    print str(entry)
    lonhash[temp.index(entry)] = entry
##
### Loop through every observation. Set timedimension and lat and long observations.
for _month in xrange(1344):

    if _month < 528:
        print(str(_month))
        print("Not yet")
    else:
        thisyear = int((_month)/12+1901)
        thismonth = ((_month) % 12)+1
        thisdate = datetime.date(thisyear,thismonth, 1)
        print(str(thisdate))
        _time = int(_month)
        for _lon in xrange(720):
            for _lat in xrange(360):
                data = [int(thisyear), int(thismonth), lonhash[_lon], lathash[_lat], f.variables[('pre')].data[_time, _lat, _lon]]
                cur.execute("INSERT INTO precip (year, month, lon, lat, pre) VALUES "+str(tuple(data))+";")


db1.commit()
cur.execute("CREATE INDEX idx_precip ON precip USING btree(year, month, lon, lat, pre);")
cur.execute("ALTER TABLE precip ADD COLUMN geom geometry;")
cur.execute("UPDATE precip SET geom = ST_SetSRID(ST_Point(lon,lat), 4326);")
cur.execute("CREATE INDEX idx_precip_geom ON precip USING gist(geom);")


db1.commit()
cur.close()
db1.close()            
print str(time.ctime())+ " Done!"

Use psycopg2's copy_from . 使用psycopg2的copy_from

It expects a file-like object, but that can be your own class that reads and processes the input file and returns it on demand via the read() and readlines() methods. 它需要一个类似文件的对象,但是可以是您自己的类,该类读取和处理输入文件,并通过read()readlines()方法按需返回该文件。

If you're not confident doing that, you could - as you said - generate a CSV tempfile and then COPY that. 如果您不确定这样做,可以-如您所说-生成CSV临时文件,然后将其COPY For very best performance you'd generate the CSV (Python's csv module is useful) then copy it to the server and use server-side COPY thetable FROM '/local/path/to/file' , thus avoiding any network overhead. 为了获得最佳性能,您将生成CSV(Python的csv模块很有用),然后将其复制到服务器并使用服务器端COPY thetable FROM '/local/path/to/file' ,从而避免了任何网络开销。

Most of the time it's easier to use copy ... from stdin via something like psql's \\copy or psycopg2's copy_from , and plenty fast enough. 在大多数情况下,通过psql的\\copy或psycopg2的copy_from类的东西copy ... from stdin使用copy ... from stdin更容易,而且足够快。 Especially if you couple it with producer/consumer feeding via Python's multiprocessing module (not as complicated as it sounds) so your code to parse the input isn't stuck waiting while the database writes rows. 特别是如果您通过Python的multiprocessing模块将其与生产者/消费者馈送相结合(听起来并不那么复杂),那么在数据库写入行时,解析输入的代码就不会等待。

For some more advice on speeding up bulk loading see How to speed up insertion performance in PostgreSQL - but I can see you're already doing at least some of that right, like creating indexes at the end and batching work into transactions. 有关加快批量加载速度的更多建议,请参见如何加快PostgreSQL中的插入性能 -但我可以看到您已经至少在做一些这样的事情,例如在末尾创建索引以及将事务批处理到事务中。

I had a similar demand, and I rewrote the Numpy array into a PostgreSQL binary input file format. 我有类似的要求,并将Numpy数组重写为PostgreSQL二进制输入文件格式。 The main drawback is that all columns of the target table need to be inserted, which gets tricky if you need to encode your geometry WKB, however you can use a temporary unlogged table to load the netCDF file into, then select that data into another table with the proper geometry type. 主要缺点是需要插入目标表的所有列,如果需要对几何图形WKB进行编码,这将很棘手,但是您可以使用临时未记录表将netCDF文件加载到其中,然后将数据选择到另一个表中具有正确的几何类型。

Details here: https://stackoverflow.com/a/8150329/327026 此处的详细信息: https : //stackoverflow.com/a/8150329/327026

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM