如何使用 python 和 pandas 分析 .log 文件以保存到數據框架中？

Question

我正在處理來自一台自動售貨機的一個示例日志文件。 （對熊貓來說很新）。 每天機器都會生成一個.log文件。

Q：如何使用python和pandas從.log文件中提取信息，並最終保存到數據框架中進行下一步分析？ （下面提供樣例輸入和output）

您可以在下面找到我的示例代碼和 sample.log 文件：

filePath = "~/sample.log" 
with open(filePath) as fp : 
    line = fp.read()
    print(lines)

示例日志文件和部分日志文件以文本形式顯示

[1] Tues Jan2019 11:03:33 - (1000000) sample \FILES\testing1.LOG
TestID: ACCD2
001 2019/02/14 00:00:20 [EVENT_NONE(0)] test1Voltage=13.8V test1Current=2.1A
test2Voltage=11.8V test2Current=12.1A currentstate=NORMAL_RUN(0) BatteryHealth=100% Battylife(hr)=1hour  Battery test speed (mph)=15 BatteryLoading=OFF
002 2019/02/14 00:00:23 [EVENT_NONE(0)] test1Voltage=13.8V test1Current=2.1A
test2Voltage=11.8V test2Current=12.1A currentstate=NORMAL_RUN(0) BatteryHealth=100% Battylife(hr)=1hour  Battery test speed (mph)=15 BatteryLoading=OFF
003 2019/02/14 00:00:32 [EVENT_NONE(0)] test1Voltage=13.8V test1Current=2.1A
test2Voltage=11.8V test2Current=12.1A currentstate=NORMAL_RUN(0) BatteryHealth=100% Battylife(hr)=1hour Battery test speed (mph)=15 BatteryLoading=OFF

樣本日志文件

樣品所需的 output 下面

所需 output

從上面的示例中，我們可以清楚地看到日志文件包含多個相似的模式信息。 而且，以下是我的一些想法/筆記：

output 結果應該忽略前兩行，因為我們只想提取從 001 到 000 的信息，或者我們可以忽略 000 只提取從 001 到 245 的信息
沒有空行（上面顯示的空行僅用於閱讀目的）； 每條記錄以 'BatteryLoading=OFF' 結尾，然后它將開始一個新的事件記錄，新編號為 010、011 等
對於每條記錄，里面的所有信息都用空格隔開，例如 001 2019/02/14 00:00:20 [EVENT_NONE(0)]，可能包含“[],()or%”
可能有一些情況像：電池測試速度（mph）=15... BatteryLoading=OFF，在這種情況下：“=”的左側會有一些空格，（在這種情況下，電池測試速度（ mph)=15，等號左邊都是空格分開的，但我們會把它們“電池測試速度（mph）”作為一個部分保存到最終結果中）； 但是，“=”的右側沒有空格

我不確定在這種情況下如何處理，有人可以與我分享一些代碼來處理上述日志文件嗎？ 謝謝你

Answer 1

歡迎來到 Python！

您做了正確的第一步，可以一次讀取整個文件，但我要展示的是使用fp.readline()一次讀取一行。 從doc 的 S7.2.1 開始，

如果f.readline()返回一個空字符串，則已到達文件末尾

我們將實施文件結尾檢查。

with open(filePath, 'r', encoding='utf16') as fp : 
    
    ln = fp.readline() # first line skipped
    ln = fp.readline() # second line skipped

    data = [] # make a list to collect data

    while True:

        ln = fp.readline()

        if ln == '':
            break #end-of-file check

        entities = ln.rstrip('\n').split(' ') # the line is split with space character, so each line will end up with 12 entities

        entities = [entity.split('=')[-1] for entity in entities] # further split each entity with `=` and only preserve the last string. Check for yourself how split works on a string with or without `=`.

        data.append(entities) # collected by the list

    data_df = pd.DataFrame(data, columns=...) # put a list of length 12 to specify the column header. Remove `columns=

如果您已將數據粘貼到文本中，我本可以測試我的代碼，但現在您需要在這方面提供幫助。

Answer 2

這個問題本身就充滿了問題和模棱兩可。 行號處理似乎很奇怪。 您的問題意味着應該忽略第一行 000 。 但是，這可能會幫助您入門

from collections import defaultdict
from pandas import DataFrame
import sys

DATA = defaultdict(dict)
SKIP = 2
# List of columns of interest
COLUMNS = ['test1Voltage', 'test1Current', 'test2Voltage', 'test2Current', 'currentstate',
           'BatteryHealth', 'Battylife(hr)', 'Battery test speed (mph)', 'BatteryLoading']


with open('testing1.log') as log:
    for _ in range(SKIP):
        next(log)
    for line in log:
        try:
            if (lineno := int(line.split()[0])) == 0 and len(DATA) != 0:
                break
            for c in COLUMNS:
                try:
                    i = line.index(c)
                    DATA[lineno][c] = line[i+len(c)+1:].split()[0]
                except ValueError:
                    pass
        except Exception as e:
            print(f'Unable to process:-\n{line}...due to {e}', file=sys.stderr)

df = DataFrame.from_dict(DATA, orient='index')

print(df)

Output：

    test1Voltage test1Current test2Voltage test2Current   currentstate BatteryHealth Battylife(hr) BatteryLoading
0          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
1          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
2          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
3          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
4          13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF
245        13.8V         2.1A        11.8V        12.1A  NORMAL_RUN(0)          100%         1hour            OFF

如何使用 python 和 pandas 分析 .log 文件以保存到數據框架中？

問題描述

2 個解決方案

解決方案1
0 2022-01-24 07:35:00

解決方案2
0 2022-01-25 07:08:16

如何使用 python 和 pandas 分析 .log 文件以保存到數據框架中？

問題描述

2 個解決方案

解決方案1 0 2022-01-24 07:35:00

解決方案2 0 2022-01-25 07:08:16

解決方案1
0 2022-01-24 07:35:00

解決方案2
0 2022-01-25 07:08:16