简体   繁体   中英

Why do I receive an empty array after web-scraping with CURL then filtering with regex?

I'm new to the curl and regex syntax. I tried to get the name of the images in this Amazon page, but I failed. I don't know why I always get an empty array.

Here is the code :

$curl = curl_init(); //$curl is going to be data type curl resource

$search_string = "aser";
$url = "https://www.amazon.com/s/field-keywords=$search_string";

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$result = curl_exec($curl);

preg_match_all('!https://images-na.ssl-images-amazon.com/images/I/[^\s]*?._AC_US200_.jpg!', $result, $matches);

$images = array_values(array_unique($matches[0]));
print_r($images);

curl_close($curl);

This is what I get when I print_r($images) :

Array ( )

ok I found out that the $result return me to a re-captcha page so I added

curl_setopt($curl, CURLOPT_COOKIE,true) ;

thank you guys for the help even if i still get an empty array in other sites who don't even use re-captcha

I've baked in some conditionals to help process unsuccessful outcomes.

Your regex pattern can be tuned up slightly by escaping the dots \\. and by replacing your negated character class [^\\s] with \\S and removing the lazy modifier on the quantifier ( *? to * ). These adjustments will improve pattern brevity, accuracy and performance.

Writing preg_match_all() inside of a condition statement is important because it will eliminate the possibility of generating a Notice when you try to access/process $matches .

I am also changing array_values(array_unique()) to array_keys(array_flip()) because array_unique() is not famous for its speed.

Code:

$search_string = "aser";
$url = "https://www.amazon.com/s/field-keywords=$search_string";

if (!$ch = curl_init()) {
    echo "Failed to generate curl handle";
} else {
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_COOKIE, true);
    if (!$result = curl_exec($ch)) {
        echo "CURL error: " , curl_error($ch);
    } else {
        // var_export($result);
        if (!$count = preg_match_all('~https://images-na\.ssl-images-amazon\.com/images/I/\S*\._AC_US200_\.jpg~', $result, $matches)) {
            echo "No matches from CURL result";
        } else {
            $unique_matches = array_keys(array_flip($matches[0]));
            echo "Number of matches (including duplicates): " , $count;
            echo "<br>Number of unique matches: " , sizeof($unique_matches);
            echo "<pre>";
                var_export($unique_matches);
            echo "</pre>";
        }
    }
    curl_close($ch);
}

Output (today):

Number of matches (including duplicates): 105
Number of unique matches: 51
array (
  0 => 'https://images-na.ssl-images-amazon.com/images/I/312aWjJbA6L._AC_US200_.jpg',
  1 => 'https://images-na.ssl-images-amazon.com/images/I/41vvgZSuo+L._AC_US200_.jpg',
  2 => 'https://images-na.ssl-images-amazon.com/images/I/51akl1-JppL._AC_US200_.jpg',
  3 => 'https://images-na.ssl-images-amazon.com/images/I/41hY4JMK9DL._AC_US200_.jpg',
  4 => 'https://images-na.ssl-images-amazon.com/images/I/51grWJDfRqL._AC_US200_.jpg',
  5 => 'https://images-na.ssl-images-amazon.com/images/I/618HsMLxiRL._AC_US200_.jpg',
  6 => 'https://images-na.ssl-images-amazon.com/images/I/51Xk7SB4XcL._AC_US200_.jpg',
  7 => 'https://images-na.ssl-images-amazon.com/images/I/41XD8vzETkL._AC_US200_.jpg',
  8 => 'https://images-na.ssl-images-amazon.com/images/I/515Llv02R-L._AC_US200_.jpg',
  9 => 'https://images-na.ssl-images-amazon.com/images/I/51PShds9wgL._AC_US200_.jpg',
  10 => 'https://images-na.ssl-images-amazon.com/images/I/21A8BB4Rr8L._AC_US200_.jpg',
  11 => 'https://images-na.ssl-images-amazon.com/images/I/41FgGD-l6IL._AC_US200_.jpg',
  12 => 'https://images-na.ssl-images-amazon.com/images/I/51cWC51Cz2L._AC_US200_.jpg',
  13 => 'https://images-na.ssl-images-amazon.com/images/I/41GSAH9C+FL._AC_US200_.jpg',
  14 => 'https://images-na.ssl-images-amazon.com/images/I/41FzWLl4rgL._AC_US200_.jpg',
  15 => 'https://images-na.ssl-images-amazon.com/images/I/41ej5-EYX4L._AC_US200_.jpg',
  16 => 'https://images-na.ssl-images-amazon.com/images/I/51cxADccMiL._AC_US200_.jpg',
  17 => 'https://images-na.ssl-images-amazon.com/images/I/51G7mMSXgCL._AC_US200_.jpg',
  18 => 'https://images-na.ssl-images-amazon.com/images/I/51baxIno6CL._AC_US200_.jpg',
  19 => 'https://images-na.ssl-images-amazon.com/images/I/31mPoO28QnL._AC_US200_.jpg',
  20 => 'https://images-na.ssl-images-amazon.com/images/I/41pZ4eg6PiL._AC_US200_.jpg',
  21 => 'https://images-na.ssl-images-amazon.com/images/I/51C8rmac8GL._AC_US200_.jpg',
  22 => 'https://images-na.ssl-images-amazon.com/images/I/61dDvHqYFaL._AC_US200_.jpg',
  23 => 'https://images-na.ssl-images-amazon.com/images/I/41sMpLjlXCL._AC_US200_.jpg',
  24 => 'https://images-na.ssl-images-amazon.com/images/I/51iWS9LJFBL._AC_US200_.jpg',
  25 => 'https://images-na.ssl-images-amazon.com/images/I/115DauVSG3L._AC_US200_.jpg',
  26 => 'https://images-na.ssl-images-amazon.com/images/I/21dMy9USZIL._AC_US200_.jpg',
  27 => 'https://images-na.ssl-images-amazon.com/images/I/51Rm4-vT2dL._AC_US200_.jpg',
  28 => 'https://images-na.ssl-images-amazon.com/images/I/51YWdlSwfEL._AC_US200_.jpg',
  29 => 'https://images-na.ssl-images-amazon.com/images/I/51EH7k5FpxL._AC_US200_.jpg',
  30 => 'https://images-na.ssl-images-amazon.com/images/I/41igaez7uIL._AC_US200_.jpg',
  31 => 'https://images-na.ssl-images-amazon.com/images/I/418QEnTiW7L._AC_US200_.jpg',
  32 => 'https://images-na.ssl-images-amazon.com/images/I/51KHWYGSWKL._AC_US200_.jpg',
  33 => 'https://images-na.ssl-images-amazon.com/images/I/41YSiBizmDL._AC_US200_.jpg',
  34 => 'https://images-na.ssl-images-amazon.com/images/I/41NI6VgawgL._AC_US200_.jpg',
  35 => 'https://images-na.ssl-images-amazon.com/images/I/41g86u-lDnL._AC_US200_.jpg',
  36 => 'https://images-na.ssl-images-amazon.com/images/I/51Dw7RNztAL._AC_US200_.jpg',
  37 => 'https://images-na.ssl-images-amazon.com/images/I/31yOzULiuJL._AC_US200_.jpg',
  38 => 'https://images-na.ssl-images-amazon.com/images/I/41cwE0JAc7L._AC_US200_.jpg',
  39 => 'https://images-na.ssl-images-amazon.com/images/I/51FczAZusTL._AC_US200_.jpg',
  40 => 'https://images-na.ssl-images-amazon.com/images/I/5123tSQVLhL._AC_US200_.jpg',
  41 => 'https://images-na.ssl-images-amazon.com/images/I/21qE9DbUPOL._AC_US200_.jpg',
  42 => 'https://images-na.ssl-images-amazon.com/images/I/51bmfezfl6L._AC_US200_.jpg',
  43 => 'https://images-na.ssl-images-amazon.com/images/I/41WlXMEj--L._AC_US200_.jpg',
  44 => 'https://images-na.ssl-images-amazon.com/images/I/61yxq875hwL._AC_US200_.jpg',
  45 => 'https://images-na.ssl-images-amazon.com/images/I/216na69C7UL._AC_US200_.jpg',
  46 => 'https://images-na.ssl-images-amazon.com/images/I/316I0rZ2DVL._AC_US200_.jpg',
  47 => 'https://images-na.ssl-images-amazon.com/images/I/31+YG+B0nJL._AC_US200_.jpg',
  48 => 'https://images-na.ssl-images-amazon.com/images/I/41NANHOzveL._AC_US200_.jpg',
  49 => 'https://images-na.ssl-images-amazon.com/images/I/41FPdhl6vlL._AC_US200_.jpg',
  50 => 'https://images-na.ssl-images-amazon.com/images/I/21w5Rqsuc-L._AC_US200_.jpg',
)

将您的正则表达式更改为:

preg_match_all('/"https:\/\/images-na\.ssl-images-amazon\.com\/images\/I\/.*_AC_US200_.jpg"/',$result,$matches);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM