用於拆分 Amazon S3 存儲桶日志列的正則表達式？

Question

我正在為我公司的 S3 存儲桶設置一個 ETL 過程，以便我們可以跟蹤我們的使用情況，我在分解 S3 日志文件的列時遇到了一些麻煩，因為亞馬遜使用空格、雙引號和方括號來分隔列。

我發現這個正則表達式： [^\\s\"']+|\"([^\"]*)\"|'([^']*)'在此 SO 帖子上： Regex for splitting a string using space當沒有被單引號或雙引號包圍時，這讓我非常接近。 我只需要幫助調整它以忽略單引號並忽略“[”和“]”之間的空格

這是我們的一個文件中的示例行：

dd8d30dd085515d73b318a83f4946b26d49294a95030e4a7919de0ba6654c362 ourbucket.name.config [31/Oct/2011:17:00:04 +0000] 184.191.213.218 - 013259AC1A20DF37 REST.GET.OBJECT ourbucket.name.config.txt "GET /ourbucket.name.config.txt HTTP/1.1" 200 - 325 325 16 16 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" -

這是格式定義： http://s3browser.com/amazon-s3-bucket-logging-server-access-logs.php

任何幫助，將不勝感激！

編輯：響應 FailDev，output 應該是包含在兩個方括號之間的任何字符串，例如 [foo bar]，兩個引號，例如“foo bar”或空格，例如 foo bar（其中 foo 和 bar 將單獨匹配。我將我提供的示例行中的每個匹配項分解為以下塊中它自己的行：

dd8d30dd085515d73b318a83f4946b26d49294a95030e4a7919de0ba6654c362 
ourbucket.name.config 
[31/Oct/2011:17:00:04 +0000] 
184.191.213.218 
- 
013259AC1A20DF37 
REST.GET.OBJECT 
ourbucket.name.config.txt 
"GET /ourbucket.name.config.txt HTTP/1.1" 
200 
- 
325 
325 
16 
16 
"-" 
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6" 
-

Answer 1

這是我寫的一個愚蠢的正則表達式來解析節點中的 s3 日志文件：

/^(.*?)\s(.*?)\s(\[.*?\])\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(\".*?\")\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(\".*?\")\s(\".*?\")\s(.*?)$/

正如我所說，這是“愚蠢的” - 它嚴重依賴於他們不更改日志格式，並且每個字段不包含任何奇怪的字符。

Answer 2

您不能使用 string.split 來完成，您需要遍歷“列”組的所有捕獲（如果您使用的是 C#）

This matches a non-quoted, non-bracketed field: [^\s\"\[\]]+
This matches a bracketed field: \[[^\]\[]+\] 
This matches a quoted field: \"[^\"]+\"

在匹配過程中最容易保留引號和括號，然后使用 Trim('[','\\','"') 將它們去掉。

@"^((?<column>[^\s\"\[\]]+|\[[^\]\[]+\]|\"[^\"]+\")\s+)+$"

Answer 3

這是一個可以幫助某人的python解決方案。 它還為您刪除引號和方括號：

import re
log = '79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be mybucket [06/Feb/2014:00:00:38 +0000] 192.0.2.3 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be A1206F460EXAMPLE REST.GET.BUCKETPOLICY - "GET /mybucket?policy HTTP/1.1" 404 NoSuchBucketPolicy 297 - 38 - "-" "S3Console/0.4" -'

regex = '(?:"([^"]+)")|(?:\[([^\]]+)\])|([^ ]+)'

# Result is a list of triples, with only one having a value
# (due to the three group types: '""' or '[]' or '')
result = re.compile(regex).findall(log)
for a, b, c in result:
    print(a or b or c)

輸出：

79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
mybucket
06/Feb/2014:00:00:38 +0000
192.0.2.3
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
A1206F460EXAMPLE
REST.GET.BUCKETPOLICY
-
GET /mybucket?policy HTTP/1.1
404
NoSuchBucketPolicy
297
-
38
-
-
S3Console/0.4
-
jon@jon-laptop:~/Downloads$ python regex.py
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
mybucket
06/Feb/2014:00:00:38 +0000
192.0.2.3
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be
A1206F460EXAMPLE
REST.GET.BUCKETPOLICY
-
GET /mybucket?policy HTTP/1.1
404
NoSuchBucketPolicy
297
-
38
-
-
S3Console/0.4
-

Answer 4

我同意@andy！ 考慮到 S3 的訪問日志已經存在了多長時間，我不敢相信有更多人沒有處理 S3 的訪問日志。

這是我使用的正則表達式

/(?:([a-z0-9]+)|-) (?:([a-z0-9\.-_]+)|-) (?:\[([^\]]+)\]|-) (?:([0-9\.]+)|-) (?:([a-z0-9]+)|-) (?:([a-z0-9.-_]+)|-) (?:([a-z\.]+)|-) (?:([a-z0-9\.-_\/]+)|-) (?:"-"|"([^"]+)"|-) (?:(\d+)|-) (?:([a-z]+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:(\d+)|-) (?:"-"|"([^"]+)"|-) (?:"-"|"([^"]+)"|-) (?:([a-z0-9]+)|-)/i

如果您使用的是 node.js，您可以利用我的模塊來使這更容易處理，或者將其移植到 C#，基本思想都在那里。

https://github.com/icodeforlove/s3-access-log-parser

Answer 5

我嘗試在 C# 中使用它，但發現上面的答案中有一些不正確的字符，你必須在最后使用非引號、非括號字段的正則表達式，否則它會匹配所有內容（使用http://regexstorm.net /測試員）：

完整的正則表達式，首先是帶括號的字段，其次是帶引號的字段，最后是不帶引號的、不帶括號的字段：

一個簡單的 C# 實現：

    MatchCollection matches = Regex.Matches(contents, @"(\[[^\]\[]+\])|(""[^""]+"")|([^\s""\[\]]+)");
    for (int i = 0; i < matches.Count; i++)
    {
        Console.WriteLine(i + ": " + matches[i].ToString().Trim('[', ']', '"'));
    }

Answer 6

這是我從AWS 知識中心復制的正則表達式並對其進行了一些修改以使其在 ASP.NET Core 中工作。

new Regex("([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\"|-) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)");

它對我們來說工作正常。 如果有人想使用 c# 類來存儲訪問日志，下面是解析日志文件的每一行並為其創建S3ServerAccessLog對象的代碼。

private List<S3ServerAccessLog> ParseLogs(string accessLogs)
{
    // split log file per new line since each log will be on a single line.
    var splittedLogs = accessLogs.Split("\r\n", StringSplitOptions.RemoveEmptyEntries);
    var parsedLogs = new List<S3ServerAccessLog>();

    foreach (var logLine in splittedLogs)
    {
        var parsedLog = ACCESS_LOG_REGEX.Split(logLine).Where(s => s.Length > 0).ToList();
                
        // construct 
        var logModel = new S3ServerAccessLog
        {
            BucketOwner = parsedLog[0],
            BucketName = parsedLog[1],
            RequestDateTime = DateTimeOffset.ParseExact(parsedLog[2], "dd/MMM/yyyy:HH:mm:ss K", CultureInfo.InvariantCulture),
            RemoteIP = parsedLog[3],
            Requester = parsedLog[4],
            RequestId = parsedLog[5],
            Operation = parsedLog[6],
            Key = parsedLog[7],
            RequestUri = parsedLog[8].Replace("\"", ""),
            HttpStatus = int.Parse(parsedLog[9]),
            ErrorCode = parsedLog[10],
            BytesSent = parsedLog[11],
            ObjectSize = parsedLog[12],
            TotalTime = parsedLog[13],
            TurnAroundTime = parsedLog[14],
            Referrer = parsedLog[15].Replace("\"", ""),
            UserAgent = parsedLog[16].Replace("\"", ""),
            VersionId = parsedLog[17],
            HostId = parsedLog[18],
            Sigv = parsedLog[19],
            CipherSuite = parsedLog[20],
            AuthType = parsedLog[21],
            EndPoint = parsedLog[22],
            TlsVersion = parsedLog[23]
        };

        parsedLogs.Add(logModel);
    }

    return parsedLogs;
}

Answer 7

我無法獲得任何已發布的解決方案來解析具有包含雙引號的請求 URI 的日志文件條目，所以這就是我在 Python 中得到的結果：

import json
import re
from collections import namedtuple

FILENAME = '/tmp/2022-11/2022-11-01-20-21-34-AB64DC3459FF2F2B'

# define a named tuple to represent each log entry
LogEntry = namedtuple(
    'LogEntry',
    [
        'bucket_owner',
        'bucket',
        'timestamp',
        'remote_ip',
        'requester',
        'request_id',
        'operation',
        's3_key',
        'request_uri',
        'http_version',
        'status_code',
        'error_code',
        'bytes_sent',
        'object_size',
        'total_time',
        'turn_around_time',
        'referrer',
        'user_agent',
        'version_id',
        'host_id',
        'sigv',
        'cipher_suite',
        'auth_type',
        'endpoint',
        'tls_version',
        'access_point_arn'
    ]
)

# compile the regular expression for parsing log entries
LOG_ENTRY_PATTERN = re.compile(
    r'(\S+) (\S+) \[(.+)\] (\S+) (\S+) (\S+) (\S+) (\S+) "(.*) HTTP\/(\d\.\d)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) "(\S+)" "(.*)" (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+) (\S+)'
)

# open the access log file
with open(FILENAME, 'r') as f:
    # iterate over each line in the file
    for line in f:
        # ignore certain types of operations
        if 'BATCH.DELETE.OBJECT' not in line \
                and 'S3.TRANSITION_SIA.OBJECT' not in line \
                and 'REST.COPY.OBJECT_GET' not in line:
            # parse the log entry using the regular expression
            match = LOG_ENTRY_PATTERN.match(line)

            if match:
                # create a LogEntry named tuple from the parsed log entry
                log_entry = LogEntry(*match.groups())
                log_entry = dict(log_entry._asdict())

                for key in log_entry:
                    if log_entry[key] == '-':
                        log_entry[key] = None

                print(json.dumps(log_entry, indent=4, default=str))

我個人發現使用namedtuple比使用列表更干凈，然后我將其轉換為dict以輕松插入 MySQL 數據庫。

用於拆分 Amazon S3 存儲桶日志列的正則表達式？

問題描述

7 個解決方案

解決方案1
3 2013-11-20 16:26:58

解決方案2
2 已采納 2011-11-01 01:16:11

解決方案3
1 2015-08-04 13:13:35

解決方案4
1 2015-09-30 12:38:59

解決方案5
0 2016-12-30 18:47:03

解決方案6
0 2021-04-17 10:37:00

解決方案7
0 2022-12-06 16:40:08

用於拆分 Amazon S3 存儲桶日志列的正則表達式？

問題描述

7 個解決方案

解決方案1 3 2013-11-20 16:26:58

解決方案2 2 已采納 2011-11-01 01:16:11

解決方案3 1 2015-08-04 13:13:35

解決方案4 1 2015-09-30 12:38:59

解決方案5 0 2016-12-30 18:47:03

解決方案6 0 2021-04-17 10:37:00

解決方案7 0 2022-12-06 16:40:08

解決方案1
3 2013-11-20 16:26:58

解決方案2
2 已采納 2011-11-01 01:16:11

解決方案3
1 2015-08-04 13:13:35

解決方案4
1 2015-09-30 12:38:59

解決方案5
0 2016-12-30 18:47:03

解決方案6
0 2021-04-17 10:37:00

解決方案7
0 2022-12-06 16:40:08