脚本优化，可进行大量更新

Question

I'm working on a script which creates a hash of some data and save it in the database. 我正在处理一个脚本，该脚本创建一些数据的哈希并将其保存在数据库中。 The data needed come from a SQL query which join around 300k rows with 500k rows. 所需的数据来自一个SQL查询，该查询将300k行与500k行连接在一起。 While parsing results, I create the hash value and update in the db using a second connexion handler (using the first one gives me a "Unread results" error). 解析结果时，我创建了哈希值并使用第二个连接处理程序在数据库中更新（使用第一个连接处理程序会给我一个“未读结果”错误）。

After a lot of investigations, I figured out that what give me the best results in terms of performances are the following: 经过大量的调查，我发现在性能方面给我最好的结果是：

Restart the select query every x iterations. 每x次迭代重新启动选择查询。 Otherwise, updates become a lot slower after a certain time 否则，一定时间后更新速度会变慢
Commit only every 200 queries instead of committing on every query 仅提交每200个查询，而不是提交每个查询
Tables used for the select query are MyISAM and indexed with primary key and fields used in join. 用于select查询的表是MyISAM，并用连接中使用的主键和字段索引。
My hash table is InnoDB and there's only the primary key (id) which is indexed. 我的哈希表是InnoDB，只有索引的主键（id）。

Here's my script: 这是我的脚本：

commit = ''       
stillgoing = True    
limit1 = 0
limit2 = 50000    
i = 0    
while stillgoing:    
    j = 0    
    # rerun select query every 50000 results
    getProductsQuery = ("SELECT distinct(p.id), p.desc, p.macode, p.manuf, "
        "u.unit, p.weight, p.number, att1.attr as attribute1, p.vcode, att2.attr as attribute2 "
        "FROM p "
        "LEFT JOIN att1 on p.id = att1.attid and att1.attrkey = 'PARAM' "
        "LEFT JOIN att2 on p.id = att2.attid and att2.attrkey = 'NODE' "
        "LEFT JOIN u on p.id = u.umid and u.lang = 'EN' "
        "limit "+str(limit1)+", "+str(limit2))                           
    db.query(getProductsQuery)
    row = db.fetchone()              
    while row is not None:
        i += 1
        j += 1
        id = str(row[0])
        # create hash value
        to_hash = '.'.join( [ helper.tostr(s) for s in row[1:]] )
        hash = hashlib.md5(to_hash.encode('utf-8')).hexdigest()
        # set query
        updQuery = ("update hashtable set hash='"+hash+"' where id="+id+" limit 1" )         
        # commit every 200 queries
        commit = 'no'
        if (i%200==0):
            i = 0
            commit = 'yes'
        # db2 is a second instance of db connexion
        # home made db connexion class
        # query function takes two parameters: query, boolean for commit
        db2.query(updQuery,commit)            
        row = db.fetchone()        
    if commit == 'no':
        db2.cnx.commit()            
    if j < limit2:
        stillgoing = False
    else:
        limit1 += limit2

Currently the script takes between 1 hour 30 and 2 hours to run completely. 当前，脚本需要1到30到2个小时才能完全运行。 These are the better performances I got since the very first version of the script. 自从脚本的第一个版本以来，这些就是我获得的更好的性能。 Is there anything I can do to make it run faster? 我有什么办法可以使其运行更快？

Answer 1

I think you should be able to do this entirely within MySQL: 我认为您应该可以完全在MySQL中执行此操作：

updateProductsQuery = "
    UPDATE hashtable AS h
    JOIN p ON h.id = p.id
    LEFT JOIN att1 on p.id = att1.attid and att1.attrkey = 'PARAM'
    LEFT JOIN att2 on p.id = att2.attid and att2.attrkey = 'NODE' 
    LEFT JOIN u on p.id = u.umid and u.lang = 'EN'
    SET h.hash = MD5(CONCAT_WS('.', p.desc, p.macode, p.manuf, u.unit, p.weight, p.number, att1.attr, p.vcode, att2.attr))
    LIMIT " + str(limit1) + ", " + str(limit2)

Answer 2

... LIMIT 0,200  -- touches 200 rows
... LIMIT 200,200  -- touches 400 rows
... LIMIT 400,200  -- touches 600 rows
... LIMIT 600,200  -- touches 800 rows
...

Get the picture? 拿照片吗 LIMIT + OFFSET is O(N*N). LIMIT + OFFSET为O（N * N）。 Quadraticly slow. 二次慢。

To get it down to O(N), you need to do a single linear scan. 为了使其降至O（N），您需要执行一次线性扫描。 If the single query (with no LIMIT/OFFSET), takes too long, then walk though the table in 'chunks': 如果单个查询（没有LIMIT / OFFSET）花费的时间太长，则在“块”中遍历表：

... WHERE id BETWEEN 1 AND 200  -- 200 rows
... WHERE id BETWEEN 201 AND 400  -- 200 rows
... WHERE id BETWEEN 401 AND 600  -- 200 rows
... WHERE id BETWEEN 601 AND 800  -- 200 rows

I blog about such here . 我在这里写博客。 If the table you are updating is InnoDB and has PRIMARY KEY(id) , then chunking by id is very efficient. 如果您要更新的表是InnoDB且具有PRIMARY KEY(id) ，则按id进行分块非常有效。

You could have autocommit=1 so that each 200-row UPDATE automatically COMMITs . 你可以有autocommit=1 ，使每个200行UPDATE自动COMMITs 。

Oh, your table are using the antique Engine, MyISAM? 哦，您的桌子使用的是古董引擎MyISAM？ Well, it will run reasonably well. 好吧，它将运行得很好。

脚本优化，可进行大量更新

问题描述

2 个解决方案

解决方案1
0 2015-08-27 12:06:44

解决方案2
0 已采纳 2015-08-29 01:28:22

脚本优化，可进行大量更新

问题描述

2 个解决方案

解决方案1 0 2015-08-27 12:06:44

解决方案2 0 已采纳 2015-08-29 01:28:22

解决方案1
0 2015-08-27 12:06:44

解决方案2
0 已采纳 2015-08-29 01:28:22