python脚本读取文本文件并将其解析为csv格式

Question

我查看了所有相关问题，但找不到解决方案。 我对 Python 很陌生。 这就是我所拥有的。

-我在 Ubuntu VM 上设置了一个蜜罐，用于监视对我的服务器的访问尝试，阻止访问，然后在文本格式的文件中输出尝试访问的详细信息。 每个的格式如下所示：

INTRUSION ATTEMPT DETECTED! from 10.0.0.1:80 (2022-06-06 13:17:24)
--------------------------
GET / HTTP/1.1 
HOST: 10.0.0.1 
X-FORWARDED-SCHEME http 
X-FORWARDED-PROTO: http 
x-FORWARDED-For: 139.162.191.89 
X-Real-IP: 139.162.191.89 
Connection: close 
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X)
Accept: */*
Accept-Encoding: gzip

文本文件随着访问尝试而不断增长，但它不是我可以用于其他程序的格式，例如 CSV。 我想做的是获取这个文件，阅读它，解析信息并将其以 CSV 格式写入一个单独的文件中，然后删除原始文件的内容以停止重复。

我在考虑可能不需要在每次读取后删除内容，并且可以通过查找重复项并省略它们在 CSV 文件中进行处理。 但是，我注意到多次尝试和包含相同 IP 地址的日志，这意味着一台主机正在尝试多次访问，因此每次删除原始文件可能是最好的。

Answer 1

这是一个粗略的代码，需要在您的日志文件上进行调整和测试

它读取日志文件并解析数据，然后将其添加到数据框中，最后添加到 CSV 文件中

import re
# NOTE: make sure pandas is installed otherwise use "python -m pip install pandas -U"
import pandas as pd

# open and read the log file
# NOTE: change the 'log_file.txt' with the log file name/directory
with open('log_file.txt', 'r') as f:
    log_txt = f.read()

# initiate a saving list
df_list = []

# split attemps by this words
for msg in log_txt.split('INTRUSION ATTEMPT DETECTED! '):
    # if emity ignore
    if not msg:
        continue

    # temporary measure
    unnamed_count = 0
    
    # split with the ---- to seperate the ip and the timestamp
    from_when, headers = msg.split('\n--------------------------\n')

    # regex to extract the ip and timestamp
    # NOTE: you can change the names by changing the value inside the <>
    row_dict = re.match(r'^from (?P<ip>\S+) \((?P<timestamp>.+)\)$', from_when).groupdict()

    # split the headers with the newline character 
    for head in headers.split('\n'):
        # if ":" in the list add it to the dictionary
        if ':' in head:
            # split by the ":" and add the key and value to the dict
            key, val = head.split(':', 1)
            row_dict[key.strip()] = val.strip()
        
        # known header without the ":"
        # NOTE: you can define the any header key you know with the same way
        elif 'X-FORWARDED-SCHEME ' in head.strip():
            # clean and add
            row_dict['X-FORWARDED-SCHEME'] = head.replace('X-FORWARDED-SCHEME ', '').strip()
        
        # unknown header without the ":"
        elif head.strip():
            row_dict[f'unnamed:{unnamed_count}'] = head.strip()
            unnamed_count+=1
    
    # add the row to the saving list after sorting it's keys to start with the unnamed then alphabetically
    df_list.append(dict(sorted(row_dict.items(), key=lambda x: (-x[0].startswith('unnamed'), x))))

# convert the saving list to dataframe then to csv file
df = pd.DataFrame(df_list)
# NOTE: replace the 'out.csv' with the output file name/directory
df.to_csv('out.csv', index=False)

样本输出

未命名：0	接受	接受编码	联系	主持人	用户代理	X-FORWARDED-PROTO	X-转发方案	X-Real-IP	ip	时间戳	x-转发-为
获取/HTTP/1.1	*/*	压缩包	关	10.0.0.1	Mozilla/5.0（Macintosh；英特尔 Mac OS X）	http	http	139.162.191.89	10.0.0.1:80	2022-06-06 13:17:24	139.162.191.89
获取/HTTP/1.1	*/*	压缩包	关	10.0.0.1	Mozilla/5.0（Macintosh；英特尔 Mac OS X）	http	http	139.162.191.89	10.0.0.1:80	2022-06-06 13:17:24	139.162.191.89
获取/HTTP/1.1	*/*	压缩包	关	10.0.0.1	Mozilla/5.0（Macintosh；英特尔 Mac OS X）	http	http	139.162.191.89	10.0.0.1:80	2022-06-06 13:17:24	139.162.191.89

python脚本读取文本文件并将其解析为csv格式

问题描述

1 个解决方案

解决方案1
1 2022-06-06 14:51:32

python脚本读取文本文件并将其解析为csv格式

问题描述

1 个解决方案

解决方案1 1 2022-06-06 14:51:32

解决方案1
1 2022-06-06 14:51:32