將圖像從實時服務器復制到本地

Question

我在不同的表中有大約600k的圖像URL，並使用下面的代碼下載所有圖像，它工作正常。 （我知道FTP是最好的選擇但不知何故我不能使用它。）

$queryRes = mysql_query("SELECT url FROM tablName LIMIT 50000"); // everytime I am using LIMIT
while ($row = mysql_fetch_object($queryRes)) {
    $info = pathinfo($row->url);
    $fileName = $info['filename'];
    $fileExtension = $info['extension'];

    try {
        copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension);
    } catch(Exception $e) {
        echo "<br/>\n unable to copy '$fileName'. Error:$e";
    }
}

問題是：

經過一段時間，比如10分鍾，腳本會出現503錯誤。 但仍然繼續下載圖像。 為什么，它應該停止復制呢？
並且它不會下載所有圖像，每次會有100到150張圖像的差異。 那么如何追蹤未下載的圖像？

我希望我已經解釋得很好。

Answer 1

首先...復制不會拋出任何異常...所以你沒有做任何錯誤處理...這就是為什么你的腳本將繼續運行...

第二......你應該使用file_get_contets甚至更好，卷曲......

例如，你可以嘗試這個功能......（我知道......每次打開和關閉卷曲......只是我在這里找到的一個例子https://stackoverflow.com/a/6307010/1164866 ）

function getimg($url) {         
    $headers[] = 'Accept: image/gif, image/x-bitmap, image/jpeg, image/pjpeg';              
    $headers[] = 'Connection: Keep-Alive';         
    $headers[] = 'Content-type: application/x-www-form-urlencoded;charset=UTF-8';         
    $user_agent = 'php';         
    $process = curl_init($url);         
    curl_setopt($process, CURLOPT_HTTPHEADER, $headers);         
    curl_setopt($process, CURLOPT_HEADER, 0);         
    curl_setopt($process, CURLOPT_USERAGENT, $useragent);         
    curl_setopt($process, CURLOPT_TIMEOUT, 30);         
    curl_setopt($process, CURLOPT_RETURNTRANSFER, 1);         
    curl_setopt($process, CURLOPT_FOLLOWLOCATION, 1);         
    $return = curl_exec($process);         
    curl_close($process);         
    return $return;     
}

甚至..嘗試使用curl_multi_exec並將您的文件並行下載，這將會快得多

看看這里：

http://www.php.net/manual/en/function.curl-multi-exec.php

編輯：

要跟蹤無法下載的文件你需要做這樣的事情

$queryRes = mysql_query("select url from tablName limit 50000"); //everytime i am using limit
while($row = mysql_fetch_object($queryRes)) {

    $info = pathinfo($row->url);    
    $fileName = $info['filename'];
    $fileExtension = $info['extension'];    

    if (!@copy("http:".$row->url, "img/$fileName"."_".$row->id.".".$fileExtension)) {
       $errors= error_get_last();
       echo "COPY ERROR: ".$errors['type'];
       echo "<br />\n".$errors['message'];
       //you can add what ever code you wnat here... out put to conselo, log in a file put an exit() to stop dowloading... 
    }
}

更多信息： http ： //www.php.net/manual/es/function.copy.php#83955

Answer 2

我自己沒有使用過copy ，我使用file_get_contents它可以正常使用遠程服務器。

編輯：

也返回false。 所以...

if( false === file_get_contents(...) )
    trigger_error(...);

Answer 3

我認為50000太大了。 網絡是每次消耗，下載圖像可能花費超過100毫秒（取決於你的網絡條件），所以50000圖像，在最穩定的情況下（沒有超時或一些其他錯誤），可能花費50000 * 100/1000/60 = 83分鍾，這對於像php這樣的腳本真的很長。 如果您將此腳本作為cgi（而不是cli）運行，通常默認情況下只有30秒（沒有set_time_limit）。 因此，我建議將此腳本設為cronjob並每10秒運行一次，以獲取大約50個url。
要使腳本每次只獲取一些圖像，您必須記住哪些已經處理（成功）。 例如，你可以在url表中添加一個標志列，默認情況下，flag = 1，如果url成功處理，它變為2，或者它變為3，這意味着url出錯了。 每次，腳本只能選擇標志= 1的那些（也可能包括3個，但有時，網址可能是錯誤的，因此重試不起作用）。
復制功能太簡單了，我建議使用curl，它更可靠，你可以得到下載的完整網絡信息。

這里的代碼：

//only fetch 50 urls each time
$queryRes = mysql_query ( "select id, url from tablName where flag=1 limit  50" );

//just prefer absolute path
$imgDirPath = dirname ( __FILE__ ) + '/';

while ( $row = mysql_fetch_object ( $queryRes ) )
{
    $info = pathinfo ( $row->url );
    $fileName = $info ['filename'];
    $fileExtension = $info ['extension'];

    //url in the table is like //www.example.com???
    $result = fetchUrl ( "http:" . $row->url, 
            $imgDirPath + "img/$fileName" . "_" . $row->id . "." . $fileExtension );

    if ($result !== true)
    {
        echo "<br/>\n unable to copy '$fileName'. Error:$result";
        //update flag to 3, finish this func yourself
        set_row_flag ( 3, $row->id );
    }
    else
    {
        //update flag to 3
        set_row_flag ( 2, $row->id );
    }
}

function fetchUrl($url, $saveto)
{
    $ch = curl_init ( $url );

    curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt ( $ch, CURLOPT_MAXREDIRS, 3 );
    curl_setopt ( $ch, CURLOPT_HEADER, false );
    curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 7 );
    curl_setopt ( $ch, CURLOPT_TIMEOUT, 60 );

    $raw = curl_exec ( $ch );

    $error = false;

    if (curl_errno ( $ch ))
    {
        $error = curl_error ( $ch );
    }
    else
    {
        $httpCode = curl_getinfo ( $ch, CURLINFO_HTTP_CODE );

        if ($httpCode != 200)
        {
            $error = 'HTTP code not 200: ' . $httpCode;
        }
    }

    curl_close ( $ch );

    if ($error)
    {
        return $error;
    }

    file_put_contents ( $saveto, $raw );

    return true;
}

Answer 4

嚴格檢查mysql_fetch_object返回值是IMO更好，因為許多類似的函數可能會在松散地檢查時返回非布爾值評估為false（例如，通過!= ）。
您不在查詢中獲取id屬性 。 您的代碼不應該像您編寫的那樣工作。
您不在結果中定義行的順序 。 幾乎總是希望有一個明確的順序。
LIMIT子句導致僅處理有限數量的行 。 如果我得到了正確的答案，您需要處理所有網址。
您正在使用已棄用的API來訪問MySQL 。 你應該考慮使用更現代的一個。 請參閱數據庫FAQ @ PHP.net 。 我沒有解決這個問題。
正如已經多次說過的那樣， copy不會拋出，它會返回成功指標 。
可變擴張是笨拙的。 不過，這個純粹是化妝品的變化。
為確保生成的輸出盡快到達用戶，請使用flush 。 使用輸出緩沖（ ob_start等）時，也需要進行處理。

應用修復程序后，代碼現在看起來像這樣：

$queryRes = mysql_query("SELECT id, url FROM tablName ORDER BY id");
while (($row = mysql_fetch_object($queryRes)) !== false) {
    $info = pathinfo($row->url);
    $fn = $info['filename'];
    if (copy(
        'http:' . $row->url,
        "img/{$fn}_{$row->id}.{$info['extension']}"
    )) {
        echo "success: $fn\n";
    } else {
        echo "fail: $fn\n";
    }
    flush();
}

問題＃2由此解決。 您將看到哪些文件已被復制，哪些未被復制。 如果進程（及其輸出）過早停止，則您知道最后處理的行的ID，並且可以查詢數據庫以查找更高的行（未處理）。 另一種方法是添加copied到tblName的布爾列，並在成功復制文件后立即更新它。 然后，您可能希望更改上面代碼中的查詢，以不包括已設置copied = 1行。

問題＃1 在PHP中的長計算中得到解決，結果在 SO上有503錯誤，而在SU上的Zend Studio中調試PHP腳本時503服務不可用。 我建議將大批量拆分為較小的批次，以固定的間隔啟動。 Cron似乎是我最好的選擇。 有沒有必要從瀏覽器中獲取這個龐大的批次？ 它會運行很長時間。

Answer 5

它可以更好地逐批處理。

實際的腳本表結構

CREATE TABLE IF NOT EXISTS `images` (
  `id` int(60) NOT NULL AUTO_INCREMENTh,
  `link` varchar(1024) NOT NULL,
  `status` enum('not fetched','fetched') NOT NULL DEFAULT 'not fetched',
  `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
);

劇本

<?php
// how many images to download in one go?
$limit = 100;
/* if set to true, the scraper reloads itself. Good for running on localhost without cron job support. Just keep the browser open and the script runs by itself ( javascript is needed) */ 
$reload = false;
// to prevent php timeout
set_time_limit(0);
// db connection ( you need pdo enabled)   
 try {
       $host = 'localhost';
       $dbname= 'mydbname';
       $user = 'root';
       $pass = '';
      $DBH = new PDO("mysql:host=$host;dbname=$dbname", $user, $pass);     
    }  
    catch(PDOException $e) {  
        echo $e->getMessage();  
    } 
$DBH->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );

// get n number of images that are not fetched
$query = $DBH->prepare("SELECT * FROM images WHERE  status = 'not fetched' LIMIT {$limit}");
$query->execute();
$files = $query->fetchAll();
// if no result, don't run
if(empty($files)){
    echo 'All files have been fetched!!!';
    die();
}
// where to save the images?
$savepath = dirname(__FILE__).'/scrapped/';
// fetch 'em!
foreach($files as $file){
        // get_url_content uses curl. Function defined later-on
    $content = get_url_content($file['link']);
        // get the file name from the url. You can use random name too. 
        $url_parts_array = explode('/' , $file['link']);
        /* assuming the image url as http:// abc . com/images/myimage.png , if we explode the string by /, the last element of the exploded array would have the filename */
        $filename = $url_parts_array[count($url_parts_array) - 1]; 
        // save fetched image
    file_put_contents($savepath.$filename , $content);
    // did the image save?
       if(file_exists($savepath.$file['link']))
       {
        // yes? Okay, let's save the status
              $query = $DBH->prepare("update images set status = 'fetched' WHERE id = ".$file['id']);
        // output the name of the file that just got downloaded
                echo $file['link']; echo '<br/>';
        $query->execute();  
    }
}

// function definition get_url_content()
function get_url_content($url){
        // ummm let's make our bot look like human
    $agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_VERBOSE, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_URL,$url);
    return curl_exec($ch);
}
//reload enabled? Reload!
if($reload)
    echo '<script>location.reload(true);</script>';

Answer 6

503是一個相當普遍的錯誤，在這種情況下可能意味着超時。 這可能是您的Web服務器，一路上的代理，甚至是PHP。

您需要確定哪個組件超時。 如果它是PHP，您可以使用set_time_limit。

另一種選擇可能是打破工作，以便每個請求只處理一個文件，然后重定向回相同的腳本繼續處理其余的。 您必須以某種方式維護在調用之間處理了哪些文件的列表。 或按數據庫ID的順序進行處理，並在重定向時將最后使用的ID傳遞給腳本。

將圖像從實時服務器復制到本地

問題描述

6 個解決方案

解決方案1
3 已采納 2013-12-13 03:21:02

解決方案2
0 2013-12-05 06:24:18

解決方案3
0 2013-12-09 07:30:00

解決方案4
0 2013-12-13 21:10:59

解決方案5
0 2013-12-14 18:31:59

解決方案6
-1 2013-12-09 07:12:45

將圖像從實時服務器復制到本地

問題描述

6 個解決方案

解決方案1 3 已采納 2013-12-13 03:21:02

解決方案2 0 2013-12-05 06:24:18

解決方案3 0 2013-12-09 07:30:00

解決方案4 0 2013-12-13 21:10:59

解決方案5 0 2013-12-14 18:31:59

解決方案6 -1 2013-12-09 07:12:45

解決方案1
3 已采納 2013-12-13 03:21:02

解決方案2
0 2013-12-05 06:24:18

解決方案3
0 2013-12-09 07:30:00

解決方案4
0 2013-12-13 21:10:59

解決方案5
0 2013-12-14 18:31:59

解決方案6
-1 2013-12-09 07:12:45