简体   繁体   English

python脚本读取文本文件并将其解析为csv格式

[英]python script to read and parse a text file into csv format

I looked all through the related questions and could not find a solution.我查看了所有相关问题,但找不到解决方案。 I'm pretty new with Python.我对 Python 很陌生。 Here's what I've got.这就是我所拥有的。

-I set up a honeypot on an Ubuntu VM that watches for access attempts to my server, blocks the access, then outputs details of the attempted access in a text formatted file. -我在 Ubuntu VM 上设置了一个蜜罐,用于监视对我的服务器的访问尝试,阻止访问,然后在文本格式的文件中输出尝试访问的详细信息。 The format of each looks like this :每个的格式如下所示:

INTRUSION ATTEMPT DETECTED! from 10.0.0.1:80 (2022-06-06 13:17:24)
--------------------------
GET / HTTP/1.1 
HOST: 10.0.0.1 
X-FORWARDED-SCHEME http 
X-FORWARDED-PROTO: http 
x-FORWARDED-For: 139.162.191.89 
X-Real-IP: 139.162.191.89 
Connection: close 
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X)
Accept: */*
Accept-Encoding: gzip

The text file just grows and grows with access attempts however it's not in a format such as CSV that I can use for other programs.文本文件随着访问尝试而不断增长,但它不是我可以用于其他程序的格式,例如 CSV。 What I'd like to do is take this file, read it, parse the information and have it written in CSV format in a separate file, then delete the contents of the original file to stop duplicates.我想做的是获取这个文件,阅读它,解析信息并将其以 CSV 格式写入一个单独的文件中,然后删除原始文件的内容以停止重复。

I'm thinking removing the contents after each read may not be needed and could be handled in the CSV file by looking for duplicates and omitting them.我在考虑可能不需要在每次读取后删除内容,并且可以通过查找重复项并省略它们在 CSV 文件中进行处理。 However, I'm noticing multiple attempts and logs containing the same IP address meaning one host is attempting access multiple times so maybe deleting the original each time may be best.但是,我注意到多次尝试和包含相同 IP 地址的日志,这意味着一台主机正在尝试多次访问,因此每次删除原始文件可能是最好的。

This is a rough code that needs to be tweaked and tested on your log file这是一个粗略的代码,需要在您的日志文件上进行调整和测试

It reads the log file and parses the data then add it into a data frame and finally a CSV file它读取日志文件并解析数据,然后将其添加到数据框中,最后添加到 CSV 文件中

import re
# NOTE: make sure pandas is installed otherwise use "python -m pip install pandas -U"
import pandas as pd

# open and read the log file
# NOTE: change the 'log_file.txt' with the log file name/directory
with open('log_file.txt', 'r') as f:
    log_txt = f.read()

# initiate a saving list
df_list = []

# split attemps by this words
for msg in log_txt.split('INTRUSION ATTEMPT DETECTED! '):
    # if emity ignore
    if not msg:
        continue

    # temporary measure
    unnamed_count = 0
    
    # split with the ---- to seperate the ip and the timestamp
    from_when, headers = msg.split('\n--------------------------\n')

    # regex to extract the ip and timestamp
    # NOTE: you can change the names by changing the value inside the <>
    row_dict = re.match(r'^from (?P<ip>\S+) \((?P<timestamp>.+)\)$', from_when).groupdict()

    # split the headers with the newline character 
    for head in headers.split('\n'):
        # if ":" in the list add it to the dictionary
        if ':' in head:
            # split by the ":" and add the key and value to the dict
            key, val = head.split(':', 1)
            row_dict[key.strip()] = val.strip()
        
        # known header without the ":"
        # NOTE: you can define the any header key you know with the same way
        elif 'X-FORWARDED-SCHEME ' in head.strip():
            # clean and add
            row_dict['X-FORWARDED-SCHEME'] = head.replace('X-FORWARDED-SCHEME ', '').strip()
        
        # unknown header without the ":"
        elif head.strip():
            row_dict[f'unnamed:{unnamed_count}'] = head.strip()
            unnamed_count+=1
    
    # add the row to the saving list after sorting it's keys to start with the unnamed then alphabetically
    df_list.append(dict(sorted(row_dict.items(), key=lambda x: (-x[0].startswith('unnamed'), x))))

# convert the saving list to dataframe then to csv file
df = pd.DataFrame(df_list)
# NOTE: replace the 'out.csv' with the output file name/directory
df.to_csv('out.csv', index=False)

sample output样本输出

unnamed:0未命名:0 Accept接受 Accept-Encoding接受编码 Connection联系 HOST主持人 User-Agent用户代理 X-FORWARDED-PROTO X-FORWARDED-PROTO X-FORWARDED-SCHEME X-转发方案 X-Real-IP X-Real-IP ip ip timestamp时间戳 x-FORWARDED-For x-转发-为
GET / HTTP/1.1获取/HTTP/1.1 */* */* gzip压缩包 close 10.0.0.1 10.0.0.1 Mozilla/5.0 (Macintosh; Intel Mac OS X) Mozilla/5.0(Macintosh;英特尔 Mac OS X) http http http http 139.162.191.89 139.162.191.89 10.0.0.1:80 10.0.0.1:80 2022-06-06 13:17:24 2022-06-06 13:17:24 139.162.191.89 139.162.191.89
GET / HTTP/1.1获取/HTTP/1.1 */* */* gzip压缩包 close 10.0.0.1 10.0.0.1 Mozilla/5.0 (Macintosh; Intel Mac OS X) Mozilla/5.0(Macintosh;英特尔 Mac OS X) http http http http 139.162.191.89 139.162.191.89 10.0.0.1:80 10.0.0.1:80 2022-06-06 13:17:24 2022-06-06 13:17:24 139.162.191.89 139.162.191.89
GET / HTTP/1.1获取/HTTP/1.1 */* */* gzip压缩包 close 10.0.0.1 10.0.0.1 Mozilla/5.0 (Macintosh; Intel Mac OS X) Mozilla/5.0(Macintosh;英特尔 Mac OS X) http http http http 139.162.191.89 139.162.191.89 10.0.0.1:80 10.0.0.1:80 2022-06-06 13:17:24 2022-06-06 13:17:24 139.162.191.89 139.162.191.89

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM