简体   繁体   English

脚本优化,可进行大量更新

[英]Script optimization for massive updates

I'm working on a script which creates a hash of some data and save it in the database. 我正在处理一个脚本,该脚本创建一些数据的哈希并将其保存在数据库中。 The data needed come from a SQL query which join around 300k rows with 500k rows. 所需的数据来自一个SQL查询,该查询将300k行与500k行连接在一起。 While parsing results, I create the hash value and update in the db using a second connexion handler (using the first one gives me a "Unread results" error). 解析结果时,我创建了哈希值并使用第二个连接处理程序在数据库中更新(使用第一个连接处理程序会给我一个“未读结果”错误)。

After a lot of investigations, I figured out that what give me the best results in terms of performances are the following: 经过大量的调查,我发现在性能方面给我最好的结果是:

  • Restart the select query every x iterations. 每x次迭代重新启动选择查询。 Otherwise, updates become a lot slower after a certain time 否则,一定时间后更新速度会变慢
  • Commit only every 200 queries instead of committing on every query 仅提交每200个查询,而不是提交每个查询
  • Tables used for the select query are MyISAM and indexed with primary key and fields used in join. 用于select查询的表是MyISAM,并用连接中使用的主键和字段索引。
  • My hash table is InnoDB and there's only the primary key (id) which is indexed. 我的哈希表是InnoDB,只有索引的主键(id)。

Here's my script: 这是我的脚本:

commit = ''       
stillgoing = True    
limit1 = 0
limit2 = 50000    
i = 0    
while stillgoing:    
    j = 0    
    # rerun select query every 50000 results
    getProductsQuery = ("SELECT distinct(p.id), p.desc, p.macode, p.manuf, "
        "u.unit, p.weight, p.number, att1.attr as attribute1, p.vcode, att2.attr as attribute2 "
        "FROM p "
        "LEFT JOIN att1 on p.id = att1.attid and att1.attrkey = 'PARAM' "
        "LEFT JOIN att2 on p.id = att2.attid and att2.attrkey = 'NODE' "
        "LEFT JOIN u on p.id = u.umid and u.lang = 'EN' "
        "limit "+str(limit1)+", "+str(limit2))                           
    db.query(getProductsQuery)
    row = db.fetchone()              
    while row is not None:
        i += 1
        j += 1
        id = str(row[0])
        # create hash value
        to_hash = '.'.join( [ helper.tostr(s) for s in row[1:]] )
        hash = hashlib.md5(to_hash.encode('utf-8')).hexdigest()
        # set query
        updQuery = ("update hashtable set hash='"+hash+"' where id="+id+" limit 1" )         
        # commit every 200 queries
        commit = 'no'
        if (i%200==0):
            i = 0
            commit = 'yes'
        # db2 is a second instance of db connexion
        # home made db connexion class
        # query function takes two parameters: query, boolean for commit
        db2.query(updQuery,commit)            
        row = db.fetchone()        
    if commit == 'no':
        db2.cnx.commit()            
    if j < limit2:
        stillgoing = False
    else:
        limit1 += limit2

Currently the script takes between 1 hour 30 and 2 hours to run completely. 当前,脚本需要1到30到2个小时才能完全运行。 These are the better performances I got since the very first version of the script. 自从脚本的第一个版本以来,这些就是我获得的更好的性能。 Is there anything I can do to make it run faster? 我有什么办法可以使其运行更快?

I think you should be able to do this entirely within MySQL: 我认为您应该可以完全在MySQL中执行此操作:

updateProductsQuery = "
    UPDATE hashtable AS h
    JOIN p ON h.id = p.id
    LEFT JOIN att1 on p.id = att1.attid and att1.attrkey = 'PARAM'
    LEFT JOIN att2 on p.id = att2.attid and att2.attrkey = 'NODE' 
    LEFT JOIN u on p.id = u.umid and u.lang = 'EN'
    SET h.hash = MD5(CONCAT_WS('.', p.desc, p.macode, p.manuf, u.unit, p.weight, p.number, att1.attr, p.vcode, att2.attr))
    LIMIT " + str(limit1) + ", " + str(limit2)
... LIMIT 0,200  -- touches 200 rows
... LIMIT 200,200  -- touches 400 rows
... LIMIT 400,200  -- touches 600 rows
... LIMIT 600,200  -- touches 800 rows
...

Get the picture? 拿照片吗 LIMIT + OFFSET is O(N*N). LIMIT + OFFSET为O(N * N)。 Quadraticly slow. 二次慢。

To get it down to O(N), you need to do a single linear scan. 为了使其降至O(N),您需要执行一次线性扫描。 If the single query (with no LIMIT/OFFSET), takes too long, then walk though the table in 'chunks': 如果单个查询(没有LIMIT / OFFSET)花费的时间太长,则在“块”中遍历表:

... WHERE id BETWEEN 1 AND 200  -- 200 rows
... WHERE id BETWEEN 201 AND 400  -- 200 rows
... WHERE id BETWEEN 401 AND 600  -- 200 rows
... WHERE id BETWEEN 601 AND 800  -- 200 rows

I blog about such here . 在这里写博客。 If the table you are updating is InnoDB and has PRIMARY KEY(id) , then chunking by id is very efficient. 如果您要更新的表是InnoDB且具有PRIMARY KEY(id) ,则按id进行分块非常有效。

You could have autocommit=1 so that each 200-row UPDATE automatically COMMITs . 你可以有autocommit=1 ,使每个200行UPDATE自动COMMITs

Oh, your table are using the antique Engine, MyISAM? 哦,您的桌子使用的是古董引擎MyISAM? Well, it will run reasonably well. 好吧,它将运行得很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM