[英]Script optimization for massive updates
I'm working on a script which creates a hash of some data and save it in the database. 我正在处理一个脚本,该脚本创建一些数据的哈希并将其保存在数据库中。 The data needed come from a SQL query which join around 300k rows with 500k rows.
所需的数据来自一个SQL查询,该查询将300k行与500k行连接在一起。 While parsing results, I create the hash value and update in the db using a second connexion handler (using the first one gives me a "Unread results" error).
解析结果时,我创建了哈希值并使用第二个连接处理程序在数据库中更新(使用第一个连接处理程序会给我一个“未读结果”错误)。
After a lot of investigations, I figured out that what give me the best results in terms of performances are the following: 经过大量的调查,我发现在性能方面给我最好的结果是:
Here's my script: 这是我的脚本:
commit = ''
stillgoing = True
limit1 = 0
limit2 = 50000
i = 0
while stillgoing:
j = 0
# rerun select query every 50000 results
getProductsQuery = ("SELECT distinct(p.id), p.desc, p.macode, p.manuf, "
"u.unit, p.weight, p.number, att1.attr as attribute1, p.vcode, att2.attr as attribute2 "
"FROM p "
"LEFT JOIN att1 on p.id = att1.attid and att1.attrkey = 'PARAM' "
"LEFT JOIN att2 on p.id = att2.attid and att2.attrkey = 'NODE' "
"LEFT JOIN u on p.id = u.umid and u.lang = 'EN' "
"limit "+str(limit1)+", "+str(limit2))
db.query(getProductsQuery)
row = db.fetchone()
while row is not None:
i += 1
j += 1
id = str(row[0])
# create hash value
to_hash = '.'.join( [ helper.tostr(s) for s in row[1:]] )
hash = hashlib.md5(to_hash.encode('utf-8')).hexdigest()
# set query
updQuery = ("update hashtable set hash='"+hash+"' where id="+id+" limit 1" )
# commit every 200 queries
commit = 'no'
if (i%200==0):
i = 0
commit = 'yes'
# db2 is a second instance of db connexion
# home made db connexion class
# query function takes two parameters: query, boolean for commit
db2.query(updQuery,commit)
row = db.fetchone()
if commit == 'no':
db2.cnx.commit()
if j < limit2:
stillgoing = False
else:
limit1 += limit2
Currently the script takes between 1 hour 30 and 2 hours to run completely. 当前,脚本需要1到30到2个小时才能完全运行。 These are the better performances I got since the very first version of the script.
自从脚本的第一个版本以来,这些就是我获得的更好的性能。 Is there anything I can do to make it run faster?
我有什么办法可以使其运行更快?
I think you should be able to do this entirely within MySQL: 我认为您应该可以完全在MySQL中执行此操作:
updateProductsQuery = "
UPDATE hashtable AS h
JOIN p ON h.id = p.id
LEFT JOIN att1 on p.id = att1.attid and att1.attrkey = 'PARAM'
LEFT JOIN att2 on p.id = att2.attid and att2.attrkey = 'NODE'
LEFT JOIN u on p.id = u.umid and u.lang = 'EN'
SET h.hash = MD5(CONCAT_WS('.', p.desc, p.macode, p.manuf, u.unit, p.weight, p.number, att1.attr, p.vcode, att2.attr))
LIMIT " + str(limit1) + ", " + str(limit2)
... LIMIT 0,200 -- touches 200 rows
... LIMIT 200,200 -- touches 400 rows
... LIMIT 400,200 -- touches 600 rows
... LIMIT 600,200 -- touches 800 rows
...
Get the picture? 拿照片吗 LIMIT + OFFSET is O(N*N).
LIMIT + OFFSET为O(N * N)。 Quadraticly slow.
二次慢。
To get it down to O(N), you need to do a single linear scan. 为了使其降至O(N),您需要执行一次线性扫描。 If the single query (with no LIMIT/OFFSET), takes too long, then walk though the table in 'chunks':
如果单个查询(没有LIMIT / OFFSET)花费的时间太长,则在“块”中遍历表:
... WHERE id BETWEEN 1 AND 200 -- 200 rows
... WHERE id BETWEEN 201 AND 400 -- 200 rows
... WHERE id BETWEEN 401 AND 600 -- 200 rows
... WHERE id BETWEEN 601 AND 800 -- 200 rows
I blog about such here . 我在这里写博客。 If the table you are updating is InnoDB and has
PRIMARY KEY(id)
, then chunking by id
is very efficient. 如果您要更新的表是InnoDB且具有
PRIMARY KEY(id)
,则按id
进行分块非常有效。
You could have autocommit=1
so that each 200-row UPDATE
automatically COMMITs
. 你可以有
autocommit=1
,使每个200行UPDATE
自动COMMITs
。
Oh, your table are using the antique Engine, MyISAM? 哦,您的桌子使用的是古董引擎MyISAM? Well, it will run reasonably well.
好吧,它将运行得很好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.