简体   繁体   中英

MySQL check large amount of data

I have two tables named actual and check

Table actual contains 50 million rows and each row contains 32-bit hashes

Table check contains 10 million rows and each row contains 32-bit hashes

I have to verify if the hashes from check table are in actual table or not.

I tried MySQL Join query like

SELECT * 
  FROM `check` 
  LEFT 
  JOIN `actual` 
    on `check`.hash = `actual`.hash;

Even on 16GB RAM machine MySQL is crashing.

I tried using PHP script by adding additional fields to Table check as field names hash, status, found.

Status & found are default 0 and PHP will check each record and update status to 1 and found to 1 if found.

Is there any way to check millions or records faster?

The other way I have INSERT using IGNORE for unique hashes and checking how many were not appended but its complex process.

The PHP code I am using is but its very slow

$sql = "SELECT * FROM `check` where status = 0 LIMIT 0, 1";
$result = $conn->query($sql);

if ($result->num_rows > 0) {
  while($row = $result->fetch_assoc()) {

    $check = "SELECT * FROM `actual` where hash = '".$row["hash"]."'";
    $checkx = $conn->query($check);

    $checky = "UPDATE `check` SET `status` = 1, `found` = 0 WHERE hash = '".$row["hash"]."'";
    $conn->query($checky);
    if ($checkx->num_rows > 0) {
      $checky = "UPDATE `check` SET `status` = 1, `found` = 1 WHERE hash = '".$row["hash"]."'";
      $conn->query($checky);
      }
    }
  }

If I've understood you right, a sub-query is all you need:

UPDATE check SET status=1, found=1 WHERE hash IN (SELECT hash FROM actual)

I don't have enough data to do a meaningful performance comparison - try it and see.

Edit: With a clearer idea of the requirement gleaned by looking at the PHP solution, here's an updated query:

UPDATE `check` SET status=1, found=(hash IN (SELECT hash FROM actual))  WHERE status=0 

Note:

  • It's important that actual.hash is indexed, or searching the actual table will take an age.
  • Depending on the balance between checked and unchecked rows in check , it might be worth indexing check.status too. If most rows are unchecked there will be no benefit, but it could work well if there are only a few unchecked ones. Writing to an indexed table could be significantly slower. You'd need to experiment with your data set to find out.

Use a multi-table UPDATE instead of IN ( SELECT... )

Also

  • What version of MySQL?
  • Please provide SHOW CREATE TABLE . We need to see the engine, indexes, datatypes, etc.
  • SHOW VARIABLES LIKE 'innodb_buffer_pool_size';

What do you mean by "crashing"? Reboot? mysqld died? Or simply that the query took forever?

Once we have optimized the query, if it still is too slow, I will show you how to do it in stages. And that will probably involve directly writing SQL, not by going through Django.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM