简体   繁体   English

InnoDB表批量插入

[英]InnoDB Table Bulk Insert

I have a MySQL data table with about half-a-billion rows in it. 我有一个MySQL数据表,其中包含约五亿行。 We need to run calculations on this data by reading it, and the calculated data (which is a standardized form of the original data) needs to be written into another InnoDB table. 我们需要通过读取该数据来进行计算,然后将计算出的数据(原始数据的标准化形式)写入另一个InnoDB表中。

The setup we currently have is a virtual cloud with a machine in it as well as the database, therefore the machine-DB connection is very fast. 我们当前拥有的设置是一个虚拟云,其中包含一台计算机以及数据库,因此计算机与数据库的连接非常快。

The calculations that occur on the data (as well as reading it) are very fast, and the bottleneck of this entire process is the insertion of the standardized data into the InnoDB tables (the standardized data contains a few indicies, though not long, which slows down the insertion). 数据上的计算(以及读取)的速度非常快,整个过程的瓶颈是将标准化数据插入InnoDB表中(标准化数据包含一些索引,尽管时间不长)。减慢插入速度)。

Unfortunately, we cannot modify certain system variables like innodb_log_file_size (we are using Amazon AWS) which would help increase insert performance. 不幸的是,我们无法修改某些系统变量,例如innodb_log_file_size(我们正在使用Amazon AWS),这将有助于提高插入性能。

What would be our best best to push all this data onto MySQL? 将所有这些数据推送到MySQL上,我们最好的办法是什么? Since the calculation process is straightforward, I can pretty much write a Python script that takes the standardized data and outputs it in any format. 由于计算过程非常简单,我几乎可以编写一个Python脚本来获取标准化数据并以任何格式输出。 Inserting this data on the fly as the calculations occur is painfully slow, and gets slower with time. 在计算过程中即时插入此数据非常缓慢,并且随着时间的推移会变得越来越慢。

I guess the question would be then, what is the best process (in terms of input format, and actual import) for inserting bulk data into InnoDB tables? 我想问题是,将批量数据插入InnoDB表的最佳过程(就输入格式和实际导入而言)是什么?

In this case, as you are not doing anything on the base table - and most likely to update the data in the secondary innodb table only scheduled interval basis, I would perfer the below steps 在这种情况下,由于您没有在基表上执行任何操作-并且最有可能仅按计划的时间间隔更新辅助innodb表中的数据,因此请执行以下步骤

  1. Take a mysqldump with --where (--where "id>91919" or --where "update_time > now() - interval 1 hour ") option. 采取与--where一个mysqldump的(--where “ID> 91919”或--where “UPDATE_TIME>现在() -间隔1小时”)的选项。 If possible avoid locking of table too 如果可能的话,也避免锁住桌子
  2. Restore the data to a temp DB table 将数据还原到临时数据库表
  3. Do your calculation on temp DB and update the secondary table 在临时数据库上进行计算并更新辅助表
  4. Drop the temp DB/table created. 删除创建的临时数据库/表。

My first instinct was to ask you to tune your buffer variables.. but as you are saying that you cant change much of the server configuration parameters, here is another option... 我的第一个直觉是要求您调整缓冲区变量..但是正如您所说的,您不能更改很多服务器配置参数,这是另一个选择...

Do the calculation and dump the output into a csv. 进行计算并将输出转储到csv中。 You would use the 'SELECT ... INTO OUTFILE' command for this. 为此,您可以使用“ SELECT ... INTO OUTFILE”命令。 Then you'd connect to the target InnoDB, and execute 'set autocommit=0' , followed by 'load data local infile ' to load this CSV back into the target table. 然后,您将连接到目标InnoDB,并执行'set autocommit = 0',然后执行'load data local infile'将此CSV重新加载到目标表中。 Finally turn autocommit back to 1. 最后将自动提交重新设置为1。

There are many other options I can suggest (like right partitioning schema, primary-key order inserts, etc), but I'd need to know the structure of your DB , incoming dataset and indexes for that. 我可以建议其他许多选项(例如正确的分区模式,主键顺序插入等),但是我需要知道数据库的结构,传入数据集和索引。

Is yours time series data? 是您的时间序列数据吗? Had a similar issue last week. 上周发生过类似的问题。 Loaded partitions , it became faster. 加载分区,它变得更快。 I also optimized my settings from http://www.ajaydivakaran.com/2013/03/12/mysql-innodb-when-inserts-start-slowing-down/ But if you cant optimize, then use partitioning for faster inserts. 我还从http://www.ajaydivakaran.com/2013/03/12/mysql-innodb-when-inserts-start-slowing-down/优化了设置,但是如果无法优化,请使用分区进行更快的插入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM