简体   繁体   中英

Image headers returning text/html

I'm attempting to retrieve images from a web page, and it has been working well so far, except one of the sites I am looking at is serving images as Content-Type: text/html , causing my script to reject it as not a real image.

This is the code snippet I am using to determine content-type:

$accepted_mime = array('image/gif', 'image/jpeg', 'image/jpg', 'image/png');    
$headers = get_headers($image);

// Find the Content-Type header
$num_headers = sizeOf($headers);
for($x=0;$x<$num_headers;$x++) {
    preg_match('/^Content-Type: (.+)$/', $headers[$x], $mime_type);
    if (isset($mime_type[1]) && in_array($mime_type[1], $accepted_mime)) {
        return true;
    }
}

For sites I've tried, they return properly (results such as image/gif , image/png , etc), but mpaa.org seems to serve their images with type text/html . Is this normal?

I added a print_r to see the header array returned by get_headers`:

Array
(
    [0] => http://www.mpaa.org/templates/images/header_mpaa_logo.gif
    [1] => Array
        (
            [0] => HTTP/1.1 200 OK
            [1] => Server: nginx/1.2.0
            [2] => Date: Sat, 17 Nov 2012 17:19:06 GMT
            [3] => Content-Type: text/html
            [4] => Connection: close
            [5] => P3P: CP="NON DSP COR ADMa OUR IND UNI COM NAV INT"
            [6] => Cache-Control: no-cache, no-store, must-revalidate
            [7] => Pragma: no-cache
        )

)

I could easily add text/html to my list of accepted content-types, but that's definitely not the ideal solution ;) Does anyone know why mpaa.org serves their images with this Content-Type ? Is it regular practice to do so (perhaps with legacy websites/servers)?

Thanks :)

The wonderful MPAA is using user-agent sniffing or checking cookies to determine if your browser supports JavaScript. Since you are not specifying a user-agent string or sending cookies, they assume you don't have JavaScript and return a page saying that, instead of the original image.

If you load this with a browser, you'll note that you do get image/gif , and the image you are after: http://www.mpaa.org/templates/images/header_mpaa_logo.gif

If you make that same request with cURL or Fiddler, or some other oddball user-agent string:

This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your browser.

Dont rely on headers. They can be changed easily and as you encounter now, are not reliable.

I would do it like this:

  • Download the image
  • Check if the image is an image (by using getimagesize or something like that)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM