[英]python script to read and parse a text file into csv format
我查看了所有相關問題,但找不到解決方案。 我對 Python 很陌生。 這就是我所擁有的。
-我在 Ubuntu VM 上設置了一個蜜罐,用於監視對我的服務器的訪問嘗試,阻止訪問,然后在文本格式的文件中輸出嘗試訪問的詳細信息。 每個的格式如下所示:
INTRUSION ATTEMPT DETECTED! from 10.0.0.1:80 (2022-06-06 13:17:24)
--------------------------
GET / HTTP/1.1
HOST: 10.0.0.1
X-FORWARDED-SCHEME http
X-FORWARDED-PROTO: http
x-FORWARDED-For: 139.162.191.89
X-Real-IP: 139.162.191.89
Connection: close
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X)
Accept: */*
Accept-Encoding: gzip
文本文件隨着訪問嘗試而不斷增長,但它不是我可以用於其他程序的格式,例如 CSV。 我想做的是獲取這個文件,閱讀它,解析信息並將其以 CSV 格式寫入一個單獨的文件中,然后刪除原始文件的內容以停止重復。
我在考慮可能不需要在每次讀取后刪除內容,並且可以通過查找重復項並省略它們在 CSV 文件中進行處理。 但是,我注意到多次嘗試和包含相同 IP 地址的日志,這意味着一台主機正在嘗試多次訪問,因此每次刪除原始文件可能是最好的。
這是一個粗略的代碼,需要在您的日志文件上進行調整和測試
它讀取日志文件並解析數據,然后將其添加到數據框中,最后添加到 CSV 文件中
import re
# NOTE: make sure pandas is installed otherwise use "python -m pip install pandas -U"
import pandas as pd
# open and read the log file
# NOTE: change the 'log_file.txt' with the log file name/directory
with open('log_file.txt', 'r') as f:
log_txt = f.read()
# initiate a saving list
df_list = []
# split attemps by this words
for msg in log_txt.split('INTRUSION ATTEMPT DETECTED! '):
# if emity ignore
if not msg:
continue
# temporary measure
unnamed_count = 0
# split with the ---- to seperate the ip and the timestamp
from_when, headers = msg.split('\n--------------------------\n')
# regex to extract the ip and timestamp
# NOTE: you can change the names by changing the value inside the <>
row_dict = re.match(r'^from (?P<ip>\S+) \((?P<timestamp>.+)\)$', from_when).groupdict()
# split the headers with the newline character
for head in headers.split('\n'):
# if ":" in the list add it to the dictionary
if ':' in head:
# split by the ":" and add the key and value to the dict
key, val = head.split(':', 1)
row_dict[key.strip()] = val.strip()
# known header without the ":"
# NOTE: you can define the any header key you know with the same way
elif 'X-FORWARDED-SCHEME ' in head.strip():
# clean and add
row_dict['X-FORWARDED-SCHEME'] = head.replace('X-FORWARDED-SCHEME ', '').strip()
# unknown header without the ":"
elif head.strip():
row_dict[f'unnamed:{unnamed_count}'] = head.strip()
unnamed_count+=1
# add the row to the saving list after sorting it's keys to start with the unnamed then alphabetically
df_list.append(dict(sorted(row_dict.items(), key=lambda x: (-x[0].startswith('unnamed'), x))))
# convert the saving list to dataframe then to csv file
df = pd.DataFrame(df_list)
# NOTE: replace the 'out.csv' with the output file name/directory
df.to_csv('out.csv', index=False)
樣本輸出
未命名:0 | 接受 | 接受編碼 | 聯系 | 主持人 | 用戶代理 | X-FORWARDED-PROTO | X-轉發方案 | X-Real-IP | ip | 時間戳 | x-轉發-為 |
---|---|---|---|---|---|---|---|---|---|---|---|
獲取/HTTP/1.1 | */* | 壓縮包 | 關 | 10.0.0.1 | Mozilla/5.0(Macintosh;英特爾 Mac OS X) | http | http | 139.162.191.89 | 10.0.0.1:80 | 2022-06-06 13:17:24 | 139.162.191.89 |
獲取/HTTP/1.1 | */* | 壓縮包 | 關 | 10.0.0.1 | Mozilla/5.0(Macintosh;英特爾 Mac OS X) | http | http | 139.162.191.89 | 10.0.0.1:80 | 2022-06-06 13:17:24 | 139.162.191.89 |
獲取/HTTP/1.1 | */* | 壓縮包 | 關 | 10.0.0.1 | Mozilla/5.0(Macintosh;英特爾 Mac OS X) | http | http | 139.162.191.89 | 10.0.0.1:80 | 2022-06-06 13:17:24 | 139.162.191.89 |
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.