简体   繁体   中英

why file_get_contents returning strange characters?

I am trying to parse http://www.desi-tashan.com/category/pakistan-tvs/aaj-tv/3-idiots/ with file_get_contents.

But it returns very unusual characters and symbols.

where as if I parse http://www.desi-tashan.com/ it works nicely. Could someone tell why is this happening?

Is there any encoding decoding involved?

The page seems to be made with wordpress..

the content you see is gzipped

you might be interested looking at gzdecode or zlib-decode (Please note that Zlib support in PHP is not enabled by default)

Your code might look like this

$url = 'http://www.desi-tashan.com/category/pakistan-tvs/aaj-tv/3-idiots/';
$content = file_get_contents($url);
$decoded_content = gzdecode($content); // or zlib_decode($content);

Another solution here on stackoverflow, which adds HTTP header Accept-Encoding in the request telling the server NOT to gzip.

However, it doesn't work on www.desi-tashan.com , the server is ignoring Accept-Encoding header, and always return gzipped content

I've seen this happen on sites where the web server is mis-configured and sends back a compressed page whether or not the client indicates that it can cope with that. (A client indicates this with the Accept-Encoding header, which file_get_contents won't send.) This generally works in web browsers, as they either request the page compressed by default, or they cope with a gzipped response even if they didn't ask for one.

(Incidentally, if on a unix-derived system, you can easily confirm that what comes back is gzipped by saving it to a file and then running file on it. Or just look at the first couple of bytes of the result yourself—gzip data starts with 1F 8B.)

Rather than unzip the content manually, I'd personally use PHP's curl library instead. You can configure that to request the content gzipped, and if you do, it will transparently uncompress the result for you:

$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'http://actualidad.rt.com/actualidad');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_ENCODING , 'gzip');
$content = curl_exec ($ch);

This is more future-proof than manually decoding the result, as if the web server gets properly configured in the future to send back plain text to clients which can't handle gzip, this code will still request and decode the compressed version.

You can simply use the javascript charAt method to obtain a string character at a specific position. Or Pretty clear, just feed the function with a filename and it will return the extension of the file you selected.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM