[英]Parsing amazon s3 log files (PHP)
I'm looking to parse the amazon s3 log files which are space delimited.我希望解析以空格分隔的亚马逊 s3 日志文件。 Only problem is, some of the space delimited fields contain spaces.
唯一的问题是,一些以空格分隔的字段包含空格。 How would I go about parsing a file like this?
我将如何解析这样的文件?
450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
You could probably use a regular expression to parse the log file to get the various parts您可能可以使用正则表达式来解析日志文件以获取各个部分
Here is an example in PHP to do that这是PHP中的一个例子来做到这一点
<?php
$string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';
$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) (?P<time>\[[^]]*\]) (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (?P<request>"[^"]*") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (?P<referrer>"[^"]*") (?P<useragent>"[^"]*") (?P<version>\S)/';
preg_match($pattern, $string, $matches);
print_r($matches);
I slightly modified the answer to Jeremy Quinton to make matched better我稍微修改了 Jeremy Quinton 的答案,以便更好地匹配
<?php
$string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';
$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) "(?P<method>[^ ]*) (?P<path>[^"]*)" (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)" (?P<version>\S)/';
preg_match($pattern, $string, $matches);
print_r($matches);
?>
result :
Array
(
[0] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
[owner] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
[1] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
[bucket] => renderd
[2] => renderd
[time] => 10/Apr/2014:19:32:23 +0000
[3] => 10/Apr/2014:19:32:23 +0000
[ip] => 75.256.56.200
[4] => 75.256.56.200
[requester] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
[5] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
[reqid] => 0231400AA3D3533C
[6] => 0231400AA3D3533C
[operation] => REST.GET.OBJECT
[7] => REST.GET.OBJECT
[key] => Trailer.mp4
[8] => Trailer.mp4
[method] => GET
[9] => GET
[path] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
[10] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
[status] => 206
[11] => 206
[error] => -
[12] => -
[bytes] => 5016183
[13] => 5016183
[size] => 16149754
[14] => 16149754
[totaltime] => 216682
[15] => 216682
[turnaround] => 39
[16] => 39
[referrer] => http://example.com
[17] => http://example.com
[useragent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
[18] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
[version] => -
[19] => -
)
Amazon have appended more fields to the log now so this is a new Regex that includes the new fields:亚马逊现在已将更多字段附加到日志中,因此这是一个包含新字段的新正则表达式:
There are some other changes too:还有一些其他的变化:
Updated Regex更新正则表达式
$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (-|"-"|"(?P<method>[^ ]*) (?P<path>\S+) (?P<protocol>[^"]*)") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (-|"(?P<referrer>[^"]*)") (-|"(?P<useragent>[^"]*)") (?P<version>\S+) (?P<hostid>\S+) (?P<sigversion>\S+) (?P<ciphersuite>\S+) (?P<authtype>\S+) (?P<hostheader>\S+) (?P<tlsversion>\S+)/';
preg_match($pattern, $string, $matches);
You can replace the empty values (dashes) and filter out the duplicate numeric indices from the $matches array like so:您可以替换空值(破折号)并从 $matches 数组中过滤掉重复的数字索引,如下所示:
$matches = array_map(
function($val) { return $val === '-' ? '' : $val; },
array_filter(
$matches,
function($key) { return !is_numeric($key); },
ARRAY_FILTER_USE_KEY
)
);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.