使用正則表達式從 4 個列表創建多個詞典

Question

我有以下 txt 文件：

197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048
168.95.156.240 - stark2413 [21/Jun/2019:15:45:31 -0700] "GET /engage HTTP/2.0" 201 9645
71.172.239.195 - dooley1853 [21/Jun/2019:15:45:32 -0700] "PUT /cutting-edge HTTP/2.0" 406 24498
180.95.121.94 - mohr6893 [21/Jun/2019:15:45:34 -0700] "PATCH /extensible/reinvent HTTP/1.1" 201 27330

我想創建一個 function 將它們轉換為多個字典，其中每一行都是一個字典：

example_dict = {"host":"146.204.224.152", "user_name":"feest6811", "time":"21/Jun/2019:15:45:24 -0700", "request":"POST /incentivize HTTP/1.1"}

到目前為止，我能夠做到這一點，為所有項目創建 4 個列表，但我不知道如何為每一行創建多個 dic：

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        host = (re.findall('(.*?)\-',logdata))
        username = re.findall('\-(.*?)\[',logdata)
        time = re.findall('\[(.*?)\]', logdata)
        request = re.findall('\"(.*?)\"',logdata)
        #for line in range(len(logdata)):
            #dc = {'host':host[line], 'user_name':user_name[line], 'time':time[line], 'request':request[line]}

Answer 1

一旦你解決了你遇到的正則表達式問題 - 下面的代碼將為你工作

import re

result = []
with open('data.txt') as f:
    lines = [l.strip() for l in f.readlines()]
    for logdata in lines:
      host = (re.findall('(.*?)\-',logdata))
      username = re.findall('\-(.*?)\[',logdata)
      _time = re.findall('\[(.*?)\]', logdata)
      request = re.findall('\"(.*?)\"',logdata)
      result.append({'host':host,'user_name':username,'time':_time,
    'request':request})
print(result)

Answer 2

使用str.split()和str.index()也可以工作，忽略對正則表達式的需要。 同樣，您可以直接遍歷文件處理程序，逐行生成一行，這樣您就不必將整個文件加載到 memory 中：

result = []

with open('logdata.txt') as f:
    for line in f:
        # Isolate host and user_name, discarding the dash in between
        host, _, user_name, remaining = line.split(maxsplit=3)

        # Find the end of the datetime and isolate it
        end_bracket = remaining.index(']')
        time_ = remaining[1:end_bracket]

        # Slice out the time from the request and strip the ending newline
        request = remaining[end_bracket + 1:].strip()

        # Create the dictionary
        result.append({
            'host': host,
            'user_name': user_name,
            'time': time_,
            'request': request
        })

from pprint import pprint
pprint(result)

Answer 3

以下代碼片段將生成一個字典列表，日志文件中的每一行對應一個字典。

import re


def parse_log(log_file):
    regex  = re.compile(r'^([0-9\.]+) - (.*) \[(.*)\] (".*")')
    
    def _extract_field(match_object, tag, index, result):
        if match_object[index]:
            result[tag] = match_object[index]

    result = []
    with open(log_file) as fh:
        for line in fh:
            match = re.search(regex, line)
            if match:
                fields = {}
                _extract_field(match, 'host'     , 1, fields)
                _extract_field(match, 'user_name', 2, fields)
                _extract_field(match, 'time'     , 3, fields)
                _extract_field(match, 'request'  , 4, fields)
            result.append(fields)

    return result


def main():
    result = parse_log('log.txt')

    for line in result:
        print(line)


if __name__ == '__main__':
    main()

Answer 4

下面 function 返回一個字典列表，其中根據您的原始問題從assets/logdata.txt的每一行匹配所需的鍵/值。

值得注意的是，應該在此基礎上實施適當的錯誤處理，因為存在明顯的邊緣情況可能會導致代碼執行意外停止。

請注意host模式的更改，這很重要。 您的示例中使用的原始模式不僅匹配每一行的host部分，在模式開頭添加一個錨點re.MULTILINE停止匹配每行其余部分匹配的誤報在你原來的例子中。

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    host = (re.findall('^(.*?)\-',logdata, re.MULTILINE))
    username = re.findall('\-(.*?)\[',logdata)
    time = re.findall('\[(.*?)\]', logdata)
    request = re.findall('\"(.*?)\"',logdata)
    return [{ "host": host[i].strip(), "username": username[i], "time": time[i], "request": request[i] } for i,h in enumerate(host)]

以上是基於您的原始帖子的簡單/最小解決方案。 有很多更簡潔、更有效的方法可以解決這個問題，但是我認為從現有代碼開始工作是相關的，讓您了解如何糾正它——而不是僅僅為您提供一個更好的優化解決方案相對來說對你來說意義不大。

Answer 5

我現在正在上這門課，我得到的答案是

import re
def logs():
with open("assets/logdata.txt", "r") as file:
    logdata = file.read()

# YOUR CODE HERE

pattern='''
(?P<host>[\w.]*)
(\ -\ )
(?P<user_name>([a-z\-]*[\d]*))
(\ \[)
(?P<time>\w.*?)
(\]\ \")
(?P<request>\w.*)
(\")
'''

lst=[]

for item in re.finditer(pattern,logdata,re.VERBOSE):
    lst.append(item.groupdict())
print(lst)
return lst

使用正則表達式從 4 個列表創建多個詞典

問題描述

5 個解決方案

解決方案1
1 2020-09-21 14:20:07

解決方案2
1 2020-09-21 14:25:42

解決方案3
1 2020-09-21 14:25:50

解決方案4
1 2020-09-21 14:29:18

解決方案5
1 2020-10-14 18:49:53

使用正則表達式從 4 個列表創建多個詞典

問題描述

5 個解決方案

解決方案1 1 2020-09-21 14:20:07

解決方案2 1 2020-09-21 14:25:42

解決方案3 1 2020-09-21 14:25:50

解決方案4 1 2020-09-21 14:29:18

解決方案5 1 2020-10-14 18:49:53

解決方案1
1 2020-09-21 14:20:07

解決方案2
1 2020-09-21 14:25:42

解決方案3
1 2020-09-21 14:25:50

解決方案4
1 2020-09-21 14:29:18

解決方案5
1 2020-10-14 18:49:53