简体   繁体   中英

optimizing insertion of data into mysql

function generateRandomData(){
@ $db = new mysqli('localhost','XXX','XXX','scores');
if(mysqli_connect_errno()) {
    echo 'Failed to connect to database. Please try again later.';
    exit;
}
$query = "insert into scoretable values(?,?,?)";
for($a = 0; $a < 1000000; $a++)
{
$stmt = $db->prepare($query);

$id = rand(1,75000);

$score = rand(1,100000);

$time = rand(1367038800 ,1369630800);
$stmt->bind_param("iii",$id,$score,$time);
$stmt->execute();
}

}

I am trying to populate a data table in mysql with a million rows of data. However, this process is extremely slow. Is there anything obvious I'm doing wrong that I could fix in order to make it run faster?

You are doing several things wrong. First thing you have to take into account is what MySQL engine you're using.

The default one is InnoDB, previously the default engine is MyISAM.

I'll write this answer under assumption you're using InnoDB, which you should be using for plethora of reasons. InnoDB operates in something called autocommit mode. That means that every query you make is wrapped in a transaction. To translate that to a language that us mere mortals can understand - every query you do without specifying BEGIN WORK; block is a transaction - ergo, MySQL will wait until hard drive confirms data has been written.

Knowing that hard drives are slow (mechanical ones are still the ones most widely used), that means your inserts will be as fast as the hard drive is. Usually, mechanical hard drives can perform about 300 input output operations per second, ergo assuming you can do 300 inserts a second - yes, you'll wait quite a bit to insert 1 million records.

So, knowing how things work - you can use them to your advantage.

The amount of data that the HDD will write per transaction will be generally very small (4KB or even less), and knowing today's HDDs can write over 100MB/sec - that indicates that we should wrap several queries into a single transaction.

That way MySQL will send quite a bit of data and wait for the HDD to confirm it wrote everything and that the whole world is fine and dandy.

So, assuming you have 1M rows you want to populate - you'll execute 1M queries. If your transactions commit 1000 queries at a time, you should perform only about 1000 write operations.

That way, your code becomes something like this:

(I am not familiar with mysqli interface so function names might be wrong, and seeing I'm typing without actually running the code - the example might not work so use it at your own risk)

function generateRandomData()
{
    $db = new mysqli('localhost','XXX','XXX','scores');

    if(mysqli_connect_errno()) {
        echo 'Failed to connect to database. Please try again later.';
        exit;
    }

    $query = "insert into scoretable values(?,?,?)";

    // We prepare ONCE, that's the point of prepared statements
    $stmt = $db->prepare($query);

    $start = 0;
    $top = 1000000;

    for($a = $start; $a < $top; $a++)
    {
        // If this is the very first iteration, start the transaction
        if($a == 0)
        {
            $db->begin_transaction();
        }
        $id = rand(1,75000);

        $score = rand(1,100000);

        $time = rand(1367038800 ,1369630800);

        $stmt->bind_param("iii",$id,$score,$time);
        $stmt->execute();


        // Commit on every thousandth query
        if( ($a % 1000) == 0 && $a != ($top - 1) )
        {
            $db->commit();

            $db->begin_transaction();
        }

        // If this is the very last query, then we just need to commit and end
        if($a == ($top - 1) )
        {
            $db->commit();
        }
    }
}

DB querying involves many interrelated tasks. As a result it is an 'expensive' process. It is even more 'expensive' when it comes to insertion/update.

Running query once is the best way to enhance performance.

You can prepare the statements in the loop and run it once.

eg.

$query = "insert into scoretable values ";
for($a = 0; $a < 1000000; $a++)
{
$values = " ('".$?."','".$?."','".$?."'), ";
$query.=$values;
...


}
...
//remove the last comma
...
$stmt = $db->prepare($query);
...
$stmt->execute();

As hinted in the comments, you need to reduce the number of queries by catenating as many inserts as possible together. In PHP, it is easy to achieve that:

$query = "insert into scoretable values";
for($a = 0; $a < 1000000; $a++) {
    $id = rand(1,75000);
    $score = rand(1,100000);
    $time = rand(1367038800 ,1369630800);
    $query .= "($id, $score, $time),";
}
$query[strlen($query)-1]= ' ';

There is a limit on the maximum size of queries you can execute, which is directly related to the max_allowed_packet server setting ( This page of the mysql documentation describes how to tune that setting to your advantage).

Therfore, you will have to reduce the loop count above to reach an appropriate query size, and repeat the process to reach the total number you want to insert, by wrapping that code with another loop.

Another practice is to disable check constraints on the table you wish to do bulk insert:

ALTER TABLE yourtablename DISABLE KEYS;
SET FOREIGN_KEY_CHECKS=0;
-- bulk insert comes here
SET FOREIGN_KEY_CHECKS=1;
ALTER TABLE yourtablename ENABLE KEYS;

This practice however must be done carefully, especially in your case since you generate the values randomly. If you have any unique key within the columns you generate, you cannot use that technique with your query as it is, as it may generate a duplicate key insert. You probably want to add a IGNORE clause to it:

$query = "insert INGORE into scoretable values";

This will cause the server to silently ignore duplicate entries on unique keys. To reach the total number of requiered inserts, just loop as many time as needed to fill up the remaining missing lines.

I suppose that the only place where you could have a unique key constraint is on the id column. In that case, you will never be able to reach the number of lines you wish to have, since it is way above the range of random values you generate for that field. Consider raising that limit, or better yet, generate your ids differently (perhaps simply by using a counter, which will make sure every record is using a different key).

Have a look at this gist I've created. It takes about 5 minutes to insert a million rows on my laptop.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM