简体   繁体   中英

How to determine if a string was compressed?

How can I determine whether a string was compressed with gzcompress (aparts from comparing sizes of string before/after calling gzuncompress , or would that be the proper way of doing it)?

PRE:
I guess, if you send a request , you can immediately look into $http_response_header to see if the one of the items in the array is a variation of Content-Encoding: gzip . But this is LAME!
there is a far better method.

Here is HOW TO...

Check if its GZIP. Like a BOSS!

according to GZIP RFC :

The header of gzip content looks like this

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+

the ID1 and ID2 identify the content as GZIP . And CM states that the ZLIB_ENCODING (the compression method) is ZLIB_ENCODING_DEFLATE - which is customarily used by GZIP with all web-servers.

oh! and they have fixed values:

  • The value of ID1 is "\\x1f"
  • The value of ID2 is "\\x8b"
  • The value of CM is "\\x08" (or just 8...)

almost there:

$is_gzip = 0 === mb_strpos($mystery_string , "\\x1f" . "\\x8b" . "\\x08");

Working example

<?php
/** @link https://gist.github.com/eladkarako/d8f3addf4e3be92bae96#file-checking_gzip_like_a_boss-php */

date_default_timezone_set("Asia/Jerusalem");

while (ob_get_level() > 0) ob_end_flush();
mb_language("uni");
@mb_internal_encoding('UTF-8');
setlocale(LC_ALL, 'en_US.UTF-8');

header('Time-Zone: Asia/Jerusalem');
header('Charset: UTF-8');
header('Content-Encoding: UTF-8');
header('Content-Type: text/plain; charset=UTF-8');
header('Access-Control-Allow-Origin: *');

function get($url, $cookie = '') {
  $html = @file_get_contents($url, false, stream_context_create([
    'http' => [
      'method' => "GET",
      'header' => implode("\r\n", [''
        , 'Pragma: no-cache'
        , 'Cache-Control: no-cache'
        , 'User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2310.0 Safari/537.36'
        , 'DNT: 1'
        , 'Accept-Language: en-US,en;q=0.8'
        , 'Accept: text/plain'
        , 'X-Forwarded-For: ' . implode(', ', array_unique(array_filter(array_map(function ($item) { return filter_input(INPUT_SERVER, $item, FILTER_SANITIZE_SPECIAL_CHARS); }, ['HTTP_X_FORWARDED_FOR', 'REMOTE_ADDR', 'HTTP_CLIENT_IP', 'SERVER_ADDR', 'REMOTE_ADDR']), function ($item) { return null !== $item; })))
        , 'Referer: http://eladkarako.com'
        , 'Connection: close'
        , 'Cookie: ' . $cookie
        , 'Accept-Encoding: gzip'
      ])
    ]]));

  $is_gzip = 0 === mb_strpos($html, "\x1f" . "\x8b" . "\x08", 0, "US-ASCII");

  return $is_gzip ? zlib_decode($html, ZLIB_ENCODING_DEFLATE) : $html;
}

$html = get('http://www.pogdesign.co.uk/cat/');

echo $html;

What do we see here that is worth mentioning?

  • start with initializing the PHP engine to use UTF-8 (since we don't really know if the web-server will return a GZIP content.
  • Providing the header Accept-Encoding: gzip , tells the web-sever, it may output a GZIP content.
  • Discovering GZIP content (you should use the multi-byte functions with ASCII encoding ).
  • Finally returning the plain output, is easy using the ZLIB methods.

A string and a compressed string are both simply sequences of bytes. You cannot really distinguish one sequence of bytes from another sequence of bytes. You should know whether a blob of bytes represents a compressed format or not from accompanying metadata.

If you really need to guess programmatically, you have several things you can try:

  • Try to uncompress the string and see if the uncompress operation succeeds. If it fails, the bytes probably did not represent a compressed string.
  • Try to check for obvious "weird" bytes like anything before 0x20 . Those bytes aren't typically used in regular text. There's no real guarantee that they occur in a compressed string though.
  • Use mb_check_encoding to see whether a string is valid in the encoding you suspect it to be in. If it isn't, it's probably compressed (or you checked for the wrong encoding). With the caveat that virtually any byte sequence is valid in virtually every single-byte encoding, so this'll only work for multi-byte encodings.

This work fine for me:

if (@gzuncompress($_xml)!==false) {
   // gzipped sring

You can simply try gzuncompress() on the data as noted by @DiDiegodaFonseca. If it fails, it was not made by gzcompress() , or it was not faithfully transmitted.

If you really want to, you can check the first two bytes for a zlib header (not a gzip header, as incorrectly suggested in the accepted answer). gzcompress() produces a zlib stream, not a gzip stream. gzencode() is what produces a gzip stream. gzdeflate() produces a raw deflate stream.

RFC 1950 describes the zlib header. It is two bytes, where the two bytes taken as a big-endian 16-bit unsigned integer must be a multiple of 31. In addition to checking that, you can check that the low four bits of the first byte is 8 (1000), and that the high bit is zero.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM