简体   繁体   中英

Parsing amazon s3 log files (PHP)

I'm looking to parse the amazon s3 log files which are space delimited. Only problem is, some of the space delimited fields contain spaces. How would I go about parsing a file like this?

450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -

You could probably use a regular expression to parse the log file to get the various parts

Here is an example in PHP to do that

<?php 
$string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';

$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) (?P<time>\[[^]]*\]) (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (?P<request>"[^"]*") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (?P<referrer>"[^"]*") (?P<useragent>"[^"]*") (?P<version>\S)/';

preg_match($pattern, $string, $matches);
print_r($matches);

I slightly modified the answer to Jeremy Quinton to make matched better

  <?php 
  $string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';

  $pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) "(?P<method>[^ ]*) (?P<path>[^"]*)" (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)" (?P<version>\S)/';

  preg_match($pattern, $string, $matches);
  print_r($matches);

  ?>

  result : 
  Array
  (
      [0] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
      [owner] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
      [1] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
      [bucket] => renderd
      [2] => renderd
      [time] => 10/Apr/2014:19:32:23 +0000
      [3] => 10/Apr/2014:19:32:23 +0000
      [ip] => 75.256.56.200
      [4] => 75.256.56.200
      [requester] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
      [5] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
      [reqid] => 0231400AA3D3533C
      [6] => 0231400AA3D3533C
      [operation] => REST.GET.OBJECT
      [7] => REST.GET.OBJECT
      [key] => Trailer.mp4
      [8] => Trailer.mp4
      [method] => GET
      [9] => GET
      [path] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
      [10] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
      [status] => 206
      [11] => 206
      [error] => -
      [12] => -
      [bytes] => 5016183
      [13] => 5016183
      [size] => 16149754
      [14] => 16149754
      [totaltime] => 216682
      [15] => 216682
      [turnaround] => 39
      [16] => 39
      [referrer] => http://example.com
      [17] => http://example.com
      [useragent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
      [18] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
      [version] => -
      [19] => -
  )

Amazon have appended more fields to the log now so this is a new Regex that includes the new fields:

  • hostid
  • sigversion
  • ciphersuite
  • authtype
  • hostheader
  • tlsversion

There are some other changes too:

  • The last Regex by Yi did not match a log row at all if the method+path , referrer or useragent values were not surrounded by quotes which is often the case if the values are empty (which is recorded as a single dash).
  • The path now has the HTTP protocol version appended on the end, so I have separated this off into a new protocol value.

Updated Regex

$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (-|"-"|"(?P<method>[^ ]*) (?P<path>\S+) (?P<protocol>[^"]*)") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (-|"(?P<referrer>[^"]*)") (-|"(?P<useragent>[^"]*)") (?P<version>\S+) (?P<hostid>\S+) (?P<sigversion>\S+) (?P<ciphersuite>\S+) (?P<authtype>\S+) (?P<hostheader>\S+) (?P<tlsversion>\S+)/';
preg_match($pattern, $string, $matches);

You can replace the empty values (dashes) and filter out the duplicate numeric indices from the $matches array like so:

$matches = array_map(
    function($val) { return $val === '-' ? '' : $val; },
    array_filter(
        $matches,
        function($key) { return !is_numeric($key); },
        ARRAY_FILTER_USE_KEY
    )
);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM