繁体   English   中英

解析亚马逊 s3 日志文件 (PHP)

[英]Parsing amazon s3 log files (PHP)

我希望解析以空格分隔的亚马逊 s3 日志文件。 唯一的问题是,一些以空格分隔的字段包含空格。 我将如何解析这样的文件?

450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -

您可能可以使用正则表达式来解析日志文件以获取各个部分

这是PHP中的一个例子来做到这一点

<?php 
$string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';

$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) (?P<time>\[[^]]*\]) (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (?P<request>"[^"]*") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (?P<referrer>"[^"]*") (?P<useragent>"[^"]*") (?P<version>\S)/';

preg_match($pattern, $string, $matches);
print_r($matches);

我稍微修改了 Jeremy Quinton 的答案,以便更好地匹配

  <?php 
  $string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';

  $pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) "(?P<method>[^ ]*) (?P<path>[^"]*)" (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)" (?P<version>\S)/';

  preg_match($pattern, $string, $matches);
  print_r($matches);

  ?>

  result : 
  Array
  (
      [0] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
      [owner] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
      [1] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
      [bucket] => renderd
      [2] => renderd
      [time] => 10/Apr/2014:19:32:23 +0000
      [3] => 10/Apr/2014:19:32:23 +0000
      [ip] => 75.256.56.200
      [4] => 75.256.56.200
      [requester] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
      [5] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
      [reqid] => 0231400AA3D3533C
      [6] => 0231400AA3D3533C
      [operation] => REST.GET.OBJECT
      [7] => REST.GET.OBJECT
      [key] => Trailer.mp4
      [8] => Trailer.mp4
      [method] => GET
      [9] => GET
      [path] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
      [10] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
      [status] => 206
      [11] => 206
      [error] => -
      [12] => -
      [bytes] => 5016183
      [13] => 5016183
      [size] => 16149754
      [14] => 16149754
      [totaltime] => 216682
      [15] => 216682
      [turnaround] => 39
      [16] => 39
      [referrer] => http://example.com
      [17] => http://example.com
      [useragent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
      [18] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
      [version] => -
      [19] => -
  )

亚马逊现在已将更多字段附加到日志中,因此这是一个包含新字段的新正则表达式:

  • 主机名
  • 签名版本
  • 密码套件
  • 认证类型
  • 主机头
  • 转换

还有一些其他的变化:

  • 如果method+pathreferreruseragent值没有被引号包围,则 Yi 的最后一个 Regex 根本不匹配日志行,如果值为空(记录为单个破折号),则通常是这种情况。
  • 现在路径的末尾附加了 HTTP 协议版本,因此我将其分离为一个新的协议值。

更新正则表达式

$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (-|"-"|"(?P<method>[^ ]*) (?P<path>\S+) (?P<protocol>[^"]*)") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (-|"(?P<referrer>[^"]*)") (-|"(?P<useragent>[^"]*)") (?P<version>\S+) (?P<hostid>\S+) (?P<sigversion>\S+) (?P<ciphersuite>\S+) (?P<authtype>\S+) (?P<hostheader>\S+) (?P<tlsversion>\S+)/';
preg_match($pattern, $string, $matches);

您可以替换空值(破折号)并从 $matches 数组中过滤掉重复的数字索引,如下所示:

$matches = array_map(
    function($val) { return $val === '-' ? '' : $val; },
    array_filter(
        $matches,
        function($key) { return !is_numeric($key); },
        ARRAY_FILTER_USE_KEY
    )
);

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM