简体繁体 English

通过PHP优化MySQL InnoDB插入

[英]Optimizing MySQL InnoDB insert through PHP

原文 2015-06-12 13:38:36 5 3 php/ mysql/ optimization/ query-optimization/ innodb

I've a Cronjob script, written in PHP with following requirements: 我有一个用PHP编写的Cronjob脚本，具有以下要求：

Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here) 步骤1（数据库服务器1）：从多个表中获取一些数据（此处有很多数据）
Step 2 (Application server): Perform some calculation 步骤2（应用服务器）：执行一些计算
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. 步骤3（DB Server 2）：计算后，将该数据插入另一个数据库（MySQL）/表（InnoDB）中以进行报告。 This table contains 97 columns, actually different rates, which can not be normalized further. 该表包含97列，实际上是不同的比率，无法进一步标准化。 This is different physical DB server and have only one DB. 这是另一台物理数据库服务器，只有一个数据库。

Script worked fine during development but on production, Step 1 returned approx 50 million records. 脚本在开发过程中运行良好，但在生产中，第1步返回了大约5000万条记录。 Result, as obvious, script run for around 4 days and then failed. 结果很明显，脚本运行了大约4天，然后失败了。 (Rough estimation, with current rate, it would have taken approx 171 days to finish) （以目前的速度粗略估算，大约需要171天才能完成）

Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time. 请注意，我们使用的是准备好的语句，而步骤1一次获取了1000条记录。

What we did till now 到目前为止我们所做的

Optimization Step 1: Multiple values in insert & drop all indexes 优化步骤1：插入和删除所有索引中的多个值

Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). 一些测试表明插入（上面的步骤3）花费了最多的时间（超过95％的时间）。 To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. 为了进行优化，在进行一些谷歌搜索之后，我们从表中删除了所有索引，并且没有一个插入查询/行，而是没有一个插入查询/ 100行。 This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month. 这使我们可以更快地进行插入，但据粗略估计，运行一次cron仍需要90天，而我们每个月需要运行一次，因为每个月都会有新数据可用。

Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command. 优化步骤2，而不是写入数据库，而是写入csv文件，然后使用linux命令导入mysql。

This step seems not working. 此步骤似乎不起作用。 Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. 在CSV文件中写入30000行花费了16分钟，我们仍然需要在MySQL中导入该CSV文件。 We have single file handler for all write operations. 我们为所有写入操作提供了一个文件处理程序。

Current state 当前状态

It seems I'm now clueless on what else can be done. 看来我现在对完成其他工作一无所知。 Some key requirements: 一些关键要求：

Script need to insert approx 50,000,000 records (will increase with time) 脚本需要插入大约5000万条记录（随时间增加）
There are 97 columns for each records, we can skip some but 85 columns at the minimum. 每个记录有97列，我们可以跳过一些，但最少要跳过85列。
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help. 根据输入，我们可以将脚本分解为三个不同的cron，以在三个不同的服务器上运行，但是插入必须在一个数据库服务器（主服务器）上完成，因此不确定是否会有所帮助。

However: 然而：

We are open to change database/storage engine (including NoSQL) 我们愿意更改数据库/存储引擎（包括NoSQL）
On production, we could have multiple database servers but insert had to be done on master only. 在生产中，我们可以有多个数据库服务器，但是插入只能在master上完成。 All read operations can be directed to slave, which are minimal and occasional (Just to generate reports) 所有读取操作都可以定向到从属设备，这很少而且偶尔（仅用于生成报告）

Question 题

I don't need any descriptive answer but can someone in short suggest what could be possible solution. 我不需要任何描述性的答案，但是总之有人可以提出什么可能的解决方案。 I just need some optimization hint and I'll do remaining R&D. 我只需要一些优化提示，剩下的就是研发工作。

We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements. 我们向所有人开放，可以更改数据库/存储引擎，服务器优化/多个服务器（数据库和应用程序），更改编程语言或满足上述要求的最佳配置。

Final expectation, cron must finish in maximum 24 hours. 最终期望，cron必须在最多24小时内完成。

Edit in optimization step 2 在优化步骤2中进行编辑

To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. 为了进一步理解为什么生成csv会花费时间，我创建了代码的副本，仅包含必要的代码。 That code is present on git https://github.com/kapilsharma/xz 该代码存在于git https://github.com/kapilsharma/xz

Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt 实验的输出文件是https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt

If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. 如果您检查上述文件，我将一次插入500000条记录并从数据库中一次获得5000条记录，从而使循环运行100次。 Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. 第一个循环花费的时间为0.25982284545898秒，而在第100个循环中的时间为3.9140808582306。 I assume its because of system resource and/or file size of csv file. 我认为这是因为系统资源和/或csv文件的文件大小。 In that case, it becomes more of programming question then DB optimization. 在那种情况下，更多的是编程问题，而不是数据库优化。 Still, can someone suggest why it is taking more time in next loops? 仍然有人可以建议为什么下一个循环要花更多时间吗？

If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. 如果需要，将提交整个代码，但生成的csv文件和sql文件除外以创建虚拟DB，因为这些文件非常大。 However they can be easily generated with code. 但是，可以使用代码轻松生成它们。

3 个解决方案

I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. 我在CakePHP上有一个邮递员cron作业，仅在600行读取并将电子邮件发送给注册用户时失败。 It couldn't even perform the job in batch operations. 它甚至无法以批处理操作执行该作业。 We finally opted for mandrill and since then it all went well. 我们最终选择了山d ，从那以后一切进展顺利。

I'd suggest (considering it a bad idea to touch the legacy system in production) : 我建议（考虑触摸生产中的旧系统是一个坏主意）：

Schedule a mirco solution in golang or node.js considering performance benchmarks , as database interaction is involved - you'll be fine with any of these. 考虑到性能基准，请在golang或node.js中安排mirco解决方案，因为涉及到数据库交互-您可以任意使用它们。 Have this micro solution perform the cron job. 让此微型解决方案执行cron工作。 (Fetch + Calculate) （获取+计算）
Reporting from NoSQL will be challenging, so you should try out using available services like Google Big Query . 从NoSQL进行报告将具有挑战性，因此您应该尝试使用可用的服务，例如Google Big Query 。 Have the cron job store data to google big query and you should get a huge performance improvement even in generating reports. 将cron作业存储到Google大查询中的数据，即使在生成报告时，也应会获得巨大的性能提升。

or 要么

With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. 将每一行插入原始db服务器1中后，设置一个消息传递机制，该机制将在每次进行插入操作（某种触发）时执行cron作业的操作，并将其存储到报表服务器中。 Possible services you can use are : Google PubSub or Pusher . 您可以使用的可能服务是： Google PubSub或Pusher 。 I think per insert time consumption will be pretty less. 我认为每次插入的时间消耗将更少。 (You can also use a async service setup which does the task of storing into the reporting database). （您还可以使用异步服务设置来执行将其存储到报告数据库中的任务）。

Hope this helps. 希望这可以帮助。

Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected. 使用OFFSET和LIMIT遍历一个表是O（N * N），这比您想要或期望的要慢得多。

Instead, walk through the table "remembering where you left off". 相反，请遍历表格“记住您离开的地方”。 It is best to use the PRIMARY KEY for such. 最好为此使用PRIMARY KEY 。 Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. 由于id看起来像没有间隙的AUTO_INCREMENT ，因此代码很简单。 My blog discusses that (and more complex chunking techniques). 我的博客对此进行了讨论（以及更复杂的分块技术）。

It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster. 它的速度不会是完整的100（500K / 5K）倍，但是会明显更快。

This is a very broad question. 这是一个非常广泛的问题。 I'd start by working out what the bottleneck is with the "insert" statement. 我将从“插入”语句找出瓶颈所在开始。 Run the code, and use whatever your operating system gives you to see what the machine is doing. 运行代码，并使用您的操作系统提供的任何功能，以查看计算机的运行状况。

If the bottleneck is CPU, you need to find the slowest part and speed it up. 如果瓶颈是CPU，则需要找到最慢的部分并加快速度。 Unlikely, given your sample code, but possible. 给定您的示例代码，但不可能。

If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design. 如果瓶颈是I / O或内存，则几乎可以肯定，您将需要更好的硬件或基本的重新设计。

The obvious way to re-design this is to find a way to handle only deltas in the 50M records. 重新设计的明显方法是找到一种仅处理50M记录中的增量的方法。 For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run. 例如，如果您可以在记录发生更改时写入审核表，则您的cron作业可以查看该审核表并挑选自上次批处理运行以来已修改的所有数据。