简体   繁体   中英

How do I guarantee that I get 1000 rows from a SELECT query? MySQL

Questions with answers that instruct on the use of LIMIT are not working for my situation, but it could be I misunderstand the limitations and imperfections of LIMIT. Or, maybe I am doing the query wrong, which makes this question legitimate.

I have a table called "child_pages" which contains a field called "url", the value of which is a url that should be scraped. Upon scraping the page belonging to that url, the resulting content html is stored in a field called "content". The child_pages table has 200,000 records.

The table also has a "scanned" and "processed" field, both of them tinyint so I can say "1" = yes, this row was scanned, and "1", this row was processed.

One script, which I have set up as a local service (Windows) will read through the child_pages table and read the value from the url field, then perform the scrape, and finally store the resulting html into the content field. When this is done, the "scanned" field will be marked "1".

Now another script is also running separately, which queries the child_pages table looking for all records that are scanned='1', but processed='0'. From that result set I'm going to read the html value of the content field from the non-processed records, finally doing something with the data I extract from the "content" field html.

This is my query:

$sql = "SELECT id,content FROM child_pages WHERE scanned='1' AND processed='0' LIMIT 1000";

I've noticed that the processing is extremely slow. I get 1 to a few records processed every five seconds. How can that be, I thought, when I'm selecting 1000 rows at a time?

So I outputted a counter of loops inside the while loop, and I find it doesn't return $counted = 1000, but rather something like $counted = 60.

I have queried the child_pages table and found that 95% of the records are processed='0', so there are plenty of records to accommodate a LIMIT 1000.

Is there a way to force my query to return 1000 rows?

Full query loop:

$start = "<div id=\"detailtable\">";
$stop = "</table></td></tr></table></div>";
$sql = "SELECT id,content FROM child_pages WHERE scanned='1' AND processed='0' LIMIT 1000";
$stmt = $db->query($sql);
$new = 0;
$lookedat = 0;
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)){
  $lookedat++;
  $content = $row['content'];
  $cid = $row['id'];
  $mark1 = strpos($content,$start);
  $mark2 = strpos($content,$stop);
  //echo $mark1 . ", " . $mark2;
  $segment = substr( $content,$mark1, ($mark2 - $mark1) + strlen($stop) );
  $doc = new simple_html_dom($segment);
  if ( ! is_null($doc->find("div[id=detailtable]", 0)) ){
    $detailtable = $doc->find("div[id=detailtable]", 0);
    if(count($detailtable) == 1){
        $e = $detailtable->children();
        $children = $e[0]->find('.data');
        $count = 0;
        $insert['processed_thru']       = trim($children[0]->plaintext);
        $insert['document_number_j']    = trim($children[1]->plaintext);
        $insert['status']               = trim($children[2]->plaintext);
        $insert['case_number']          = trim($children[3]->plaintext);
        $insert['name_of_court']        = trim($children[4]->plaintext);
        $insert['file_date']            = trim($children[5]->plaintext);
        $insert['date_of_entry']        = trim($children[6]->plaintext);
        $insert['expiration_date']      = trim($children[7]->plaintext);
        $insert['amount_due']           = trim(str_replace("$","",$children[8]->plaintext));
        $insert['interest_rate']        = trim($children[9]->plaintext);
        $insert['plaintiff']            = trim($children[10]->plaintext);

        $insert['defendant'] = "";

        for($iii=11;$iii<count($children) ;$iii++){
            $insert['defendant'] .= trim($children[$iii]->plaintext);
        }

        if( $insert['status'] !== "TERMINATED" &&
            strpos($insert['plaintiff'],"STATE OF FLORIDA") == false &&
            strpos($insert['plaintiff'],"DEPARTMENT OF REVENUE") == false &&
            strpos($insert['plaintiff'],"DEPARTMENT OF ENVIRONMENTAL PROTECTION") == false){

            //net elements here

            /*echo "<pre>";
            print_r($insert);*/

            // table: cases2 columns:  id,processed_thru,document_number_j,status,case_number,name_of_court,file_date,date_of_entry,expiration_date,amount_due,interest_rate,plaintiff,defendant
            $colstring = "processed_thru,document_number_j,status,case_number,name_of_court,file_date,date_of_entry,expiration_date,amount_due,interest_rate,plaintiff,defendant";
            $prepareColString = ":processed_thru,:document_number_j,:status,:case_number,:name_of_court,:file_date,:date_of_entry,:expiration_date,:amount_due,:interest_rate,:plaintiff,:defendant";
            $table = "cases";

            foreach($insert as $k=>$v){
                ${"$k"} = trim(preg_replace( '/\h+/', ' ', $v ));
            }

            $stmt2 = $db->prepare("INSERT INTO $table ($colstring) VALUES ($prepareColString)");
            $stmt2->bindParam(':document_number_j', $document_number_j);
            $stmt2->bindParam(':processed_thru', $processed_thru);
            $stmt2->bindParam(':status', $status);
            $stmt2->bindParam(':case_number', $case_number);
            $stmt2->bindParam(':name_of_court', $name_of_court);
            $stmt2->bindParam(':file_date', $file_date);
            $stmt2->bindParam(':date_of_entry', $date_of_entry);
            $stmt2->bindParam(':expiration_date', $expiration_date);
            $stmt2->bindParam(':amount_due', $amount_due);
            $stmt2->bindParam(':interest_rate', $interest_rate);
            $stmt2->bindParam(':plaintiff', $plaintiff);
            $stmt2->bindParam(':defendant', $defendant);
            $stmt2->execute();

            $new++;
        }
    }
  }
  $processed = 1;
  $stmt3 = $db->prepare("UPDATE child_pages SET processed=:processed WHERE id=:id");
  $stmt3->bindParam(':id', $cid);
  $stmt3->bindParam(':processed', $processed);
  $stmt3->execute();
}

Accessory data:

RECORDS SCANNED : 60

NEW CASE RECORDS : 8

COMPUTATIONS IN ms : 422
SYSTEM CALLS IN ms : 15

Total execution time in seconds: 129.66131019592

Code that outputs the accessory data: (these are placed at top of script)

// At start of script
$time_start = microtime(true);
$rustart = getrusage();

function rutime($ru, $rus, $index) {
    return ($ru["ru_$index.tv_sec"]*1000 + intval($ru["ru_$index.tv_usec"]/1000))
 -  ($rus["ru_$index.tv_sec"]*1000 + intval($rus["ru_$index.tv_usec"]/1000));
}


echo "<p>RECORDS SCANNED : $lookedat </p>";
echo "<p>NEW CASE RECORDS : $new </p>";

$ru = getrusage();

echo "<p>COMPUTATIONS IN ms : " . rutime($ru, $rustart, "utime") . "</p>";
echo "SYSTEM CALLS IN ms : " . rutime($ru, $rustart, "stime") . "</p>";

// Anywhere else in the script
echo '<p>Total execution time in seconds: ' . (microtime(true) - $time_start) . "</p>";

If your goal is show all 1000 row in one page maybe you can load it by portions. when page open show like 100 row when scroll down using AJAX load more.

1st portion

$sql = "SELECT id,content FROM child_pages WHERE scanned='1' AND processed='0' LIMIT 0,100";

2nd portion

$sql = "SELECT id,content FROM child_pages WHERE scanned='1' AND processed='0' LIMIT 100,200";

etc...

Hope it helped.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM