简体   繁体   English

解析亚马逊 s3 日志文件 (PHP)

[英]Parsing amazon s3 log files (PHP)

I'm looking to parse the amazon s3 log files which are space delimited.我希望解析以空格分隔的亚马逊 s3 日志文件。 Only problem is, some of the space delimited fields contain spaces.唯一的问题是,一些以空格分隔的字段包含空格。 How would I go about parsing a file like this?我将如何解析这样的文件?

450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -

You could probably use a regular expression to parse the log file to get the various parts您可能可以使用正则表达式来解析日志文件以获取各个部分

Here is an example in PHP to do that这是PHP中的一个例子来做到这一点

<?php 
$string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';

$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) (?P<time>\[[^]]*\]) (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (?P<request>"[^"]*") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (?P<referrer>"[^"]*") (?P<useragent>"[^"]*") (?P<version>\S)/';

preg_match($pattern, $string, $matches);
print_r($matches);

I slightly modified the answer to Jeremy Quinton to make matched better我稍微修改了 Jeremy Quinton 的答案,以便更好地匹配

  <?php 
  $string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';

  $pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) "(?P<method>[^ ]*) (?P<path>[^"]*)" (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)" (?P<version>\S)/';

  preg_match($pattern, $string, $matches);
  print_r($matches);

  ?>

  result : 
  Array
  (
      [0] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
      [owner] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
      [1] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
      [bucket] => renderd
      [2] => renderd
      [time] => 10/Apr/2014:19:32:23 +0000
      [3] => 10/Apr/2014:19:32:23 +0000
      [ip] => 75.256.56.200
      [4] => 75.256.56.200
      [requester] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
      [5] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
      [reqid] => 0231400AA3D3533C
      [6] => 0231400AA3D3533C
      [operation] => REST.GET.OBJECT
      [7] => REST.GET.OBJECT
      [key] => Trailer.mp4
      [8] => Trailer.mp4
      [method] => GET
      [9] => GET
      [path] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
      [10] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
      [status] => 206
      [11] => 206
      [error] => -
      [12] => -
      [bytes] => 5016183
      [13] => 5016183
      [size] => 16149754
      [14] => 16149754
      [totaltime] => 216682
      [15] => 216682
      [turnaround] => 39
      [16] => 39
      [referrer] => http://example.com
      [17] => http://example.com
      [useragent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
      [18] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
      [version] => -
      [19] => -
  )

Amazon have appended more fields to the log now so this is a new Regex that includes the new fields:亚马逊现在已将更多字段附加到日志中,因此这是一个包含新字段的新正则表达式:

  • hostid主机名
  • sigversion签名版本
  • ciphersuite密码套件
  • authtype认证类型
  • hostheader主机头
  • tlsversion转换

There are some other changes too:还有一些其他的变化:

  • The last Regex by Yi did not match a log row at all if the method+path , referrer or useragent values were not surrounded by quotes which is often the case if the values are empty (which is recorded as a single dash).如果method+pathreferreruseragent值没有被引号包围,则 Yi 的最后一个 Regex 根本不匹配日志行,如果值为空(记录为单个破折号),则通常是这种情况。
  • The path now has the HTTP protocol version appended on the end, so I have separated this off into a new protocol value.现在路径的末尾附加了 HTTP 协议版本,因此我将其分离为一个新的协议值。

Updated Regex更新正则表达式

$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (-|"-"|"(?P<method>[^ ]*) (?P<path>\S+) (?P<protocol>[^"]*)") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (-|"(?P<referrer>[^"]*)") (-|"(?P<useragent>[^"]*)") (?P<version>\S+) (?P<hostid>\S+) (?P<sigversion>\S+) (?P<ciphersuite>\S+) (?P<authtype>\S+) (?P<hostheader>\S+) (?P<tlsversion>\S+)/';
preg_match($pattern, $string, $matches);

You can replace the empty values (dashes) and filter out the duplicate numeric indices from the $matches array like so:您可以替换空值(破折号)并从 $matches 数组中过滤掉重复的数字索引,如下所示:

$matches = array_map(
    function($val) { return $val === '-' ? '' : $val; },
    array_filter(
        $matches,
        function($key) { return !is_numeric($key); },
        ARRAY_FILTER_USE_KEY
    )
);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM