[英]Creating and populating a table is slow and unstable
When using csv_copy
to create/populate a table, I notice it is extremely slow sometimes. 使用csv_copy
创建/填充表时,我注意到有时它非常慢。 The following are the core code and some sample outputs. 以下是核心代码和一些示例输出。
I have two questions: 我有两个问题:
Code: 码:
def create_populate_table(table_name,fields,types,cur):
sql = 'CREATE TABLE IF NOT EXISTS ' + table_name + ' (\n'
for i in xrange(len(fields)):
if i==0:
sql += fields[i]+' '+types[i]+' NOT NULL PRIMARY KEY,\n'
elif i==len(fields)-1:
sql += fields[i]+' '+types[i]+')'
else:
sql += fields[i]+' '+types[i]+',\n'
#print sql
cur.execute(sql)
conn.commit()
print "Table ",table_name," created ",timer()
cur.execute("SELECT count(*) from "+table_name)
if cur.fetchone()[0]>0:
return
# populate data into created table
fr= open(file, 'r')
fr.readline()
# parse and convert data into unicode
#data = unicode_csv_reader(fr, delimiter='|')
# anything can be used as a file if it has .read() and .readline() methods
data = StringIO.StringIO()
s=''.join(fr.readlines())
while(s.find('\r\n')<>-1):
s=s.replace('\r\n','\n')
#timer()
while(s.find('||')<>-1 or s.find('|\n')<>-1 ):
s=s.replace('||','|0|')
s=s.replace('|\n','|0\n')
#timer()
#print s.split('\t')[:2]
#exit(0)
data.write(s)
data.seek(0)
try:
cur.copy_from(data, table_name,sep='|')
conn.commit()
print "Table ",table_name," populated ",timer()
except psycopg2.DatabaseError, e:
if conn:
conn.rollback()
print 'Error %s' % e
fr.close()
The outputs I see: 我看到的输出:
ME_Features_20121001.txt Table ME_Features_20121001 created 1.44s None Table ME_Features_20121001 populated 1.48s None ME_Features_20121001.txt已创建表ME_Features_20121001 1.44s无已填充表ME_Features_20121001 1.48s无
FM_Features_20121001.txt Table FM_Features_20121001 created 67.92s None Table FM_Features_20121001 populated 0.22s None FM_Features_20121001.txt已创建表FM_Features_20121001 67.92s无已填充表FM_Features_20121001 0.22s无
NationalFile_20121001.txt (700mb) Table NationalFile_20121001 created 9.34s None Table NationalFile_20121001 populated 4963.18s None NationalFile_20121001.txt(700mb)表NationalFile_20121001创建了9.34s无表NationalFile_20121001填充了4963.18s无
NJ_Features_20121001.txt Table NJ_Features_20121001 created 1.65s None Table NJ_Features_20121001 populated 41.11s None NJ_Features_20121001.txt已创建表NJ_Features_20121001 1.65s无已填充表NJ_Features_20121001 41.11s无
PW_Features_20121001.txt Table PW_Features_20121001 created 1.73s None Table PW_Features_20121001 populated 0.20s None PW_Features_20121001.txt已创建表PW_Features_20121001 1.73s无已填充表PW_Features_20121001 0.20s无
How is timer()
defined? 如何定义timer()
? My blind guess (as you didn't provide its code) is that this function calls print
directly to output the measured time, but doesn't return anything explicitly - hence None
is printed. 我的盲目猜测(因为您未提供其代码)是该函数直接调用print
来输出测量的时间,但没有明确返回任何内容-因此, None
打印任何内容。 If it's still unclear, look at the example below: 如果仍然不清楚,请查看以下示例:
>>> def test():
... print 'test'
...
>>> print 'This is a', test()
This is a test
None
I'm not sure what you mean saying that the time varies for creating and populating tables . 我不确定您的意思是创建和填充表的时间会有所不同 。 Time needed to populate the table depends on the amount of data to insert, obviously. 显然,填充表所需的时间取决于要插入的数据量。 Time needed to create a table should be more or less the same in each case, so the 67.92s
output looks suspicious indeed, but... are you sure it's measured properly? 在每种情况下,创建表所需的时间应大致相同,因此67.92s
输出确实看起来可疑,但是...您确定测量正确吗?
Again, my blind guess is that timer()
prints the time since last call. 同样,我的盲目猜测是timer()
打印自上次调用以来的时间。 Perhaps you should explicitly reset it before starting the operation you want to measure? 也许您应该在开始要测量的操作之前明确重置它? I guess that those 60 seconds were spent before calling create_populate_table()
... 我猜那是花了60秒钟才调用create_populate_table()
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.