I'm looking to parse the amazon s3 log files which are space delimited. Only problem is, some of the space delimited fields contain spaces. How would I go about parsing a file like this?
450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
You could probably use a regular expression to parse the log file to get the various parts
Here is an example in PHP to do that
<?php
$string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';
$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) (?P<time>\[[^]]*\]) (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (?P<request>"[^"]*") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (?P<referrer>"[^"]*") (?P<useragent>"[^"]*") (?P<version>\S)/';
preg_match($pattern, $string, $matches);
print_r($matches);
I slightly modified the answer to Jeremy Quinton to make matched better
<?php
$string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';
$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) "(?P<method>[^ ]*) (?P<path>[^"]*)" (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)" (?P<version>\S)/';
preg_match($pattern, $string, $matches);
print_r($matches);
?>
result :
Array
(
[0] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
[owner] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
[1] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
[bucket] => renderd
[2] => renderd
[time] => 10/Apr/2014:19:32:23 +0000
[3] => 10/Apr/2014:19:32:23 +0000
[ip] => 75.256.56.200
[4] => 75.256.56.200
[requester] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
[5] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
[reqid] => 0231400AA3D3533C
[6] => 0231400AA3D3533C
[operation] => REST.GET.OBJECT
[7] => REST.GET.OBJECT
[key] => Trailer.mp4
[8] => Trailer.mp4
[method] => GET
[9] => GET
[path] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
[10] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
[status] => 206
[11] => 206
[error] => -
[12] => -
[bytes] => 5016183
[13] => 5016183
[size] => 16149754
[14] => 16149754
[totaltime] => 216682
[15] => 216682
[turnaround] => 39
[16] => 39
[referrer] => http://example.com
[17] => http://example.com
[useragent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
[18] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
[version] => -
[19] => -
)
Amazon have appended more fields to the log now so this is a new Regex that includes the new fields:
There are some other changes too:
Updated Regex
$pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (-|"-"|"(?P<method>[^ ]*) (?P<path>\S+) (?P<protocol>[^"]*)") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (-|"(?P<referrer>[^"]*)") (-|"(?P<useragent>[^"]*)") (?P<version>\S+) (?P<hostid>\S+) (?P<sigversion>\S+) (?P<ciphersuite>\S+) (?P<authtype>\S+) (?P<hostheader>\S+) (?P<tlsversion>\S+)/';
preg_match($pattern, $string, $matches);
You can replace the empty values (dashes) and filter out the duplicate numeric indices from the $matches array like so:
$matches = array_map(
function($val) { return $val === '-' ? '' : $val; },
array_filter(
$matches,
function($key) { return !is_numeric($key); },
ARRAY_FILTER_USE_KEY
)
);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.