有沒有更快的方法來解析這個文本文件？

Question

我正在從一些類似於此的文本文件中解析日期/時間/測量信息：

[Sun Jul 15 09:05:56.724 2018] *000129.32347
[Sun Jul 15 09:05:57.722 2018] *000129.32352
[Sun Jul 15 09:05:58.721 2018] *000129.32342
[Sun Jul 15 09:05:59.719 2018] *000129.32338
[Sun Jul 15 09:06:00.733 2018] *000129.32338
[Sun Jul 15 09:06:01.732 2018] *000129.32352

結果進入這樣的輸出文件：

07-15-2018 09:05:56.724, 29.32347
07-15-2018 09:05:57.722, 29.32352
07-15-2018 09:05:58.721, 29.32342
07-15-2018 09:05:59.719, 29.32338
07-15-2018 09:06:00.733, 29.32338
07-15-2018 09:06:01.732, 29.32352

我正在使用的代碼如下所示：

import os
import datetime

with open('dq_barorun_20180715_calibtest.log', 'r') as fh, open('output.txt' , 'w') as fh2:
    for line in fh:
        line = line.split()
        monthalpha = line[1]
        month = datetime.datetime.strptime(monthalpha, '%b').strftime('%m')
        day = line[2]
        time = line[3]
        yearbracket = line[4]
        year = yearbracket[0:4]
        pressfull = line[5]
        press = pressfull[5:13]
        timestamp = month+"-"+day+"-"+year+" "+time
        fh2.write(timestamp + ", " + press + "\n")

這段代碼運行良好並完成了我的需要，但我正在嘗試學習更有效的 Python 文件解析方法。 處理一個 100MB 的文件大約需要 30 秒，我有幾個大小為 1-2GB 的文件。 有沒有更快的方法解析這個文件？

Answer 1

您可以聲明months dict 不使用datetime模塊，這應該會快一點。

months = {"Jan": "01", "Feb": "02", "Mar": "03", "Apr": "04", "May": "05", "Jun": "06",
          "Jul": "07", "Aug": "08", "Sep": "09", "Oct": "10", "Nov": "11", "Dec": "12"}

您也可以使用解包使您的代碼更簡單：

for line in fh:
    _, month, day, time, year, last = line.split()
    res = months[month] + "-" + day + "-" + year[:4] + " " + time + ", " + last[5:]
    fh2.write(res)

PS timeit顯示它大約快 10 倍

Answer 2

您應該使用 Pandas DataFrames 來快速加載和操作大數據：

import pandas as pd, datetime

df = pd.read_csv('dq_barorun_20180715_calibtest.log',header=None,sep=' ')
df[0] = df.apply(lambda x: x[0][1:],axis=1)
df[1] = df.apply(lambda x: datetime.datetime.strptime(x[1], '%b').strftime('%m'), axis=1)
df[4] = df.apply(lambda x: x[4][:-1],axis=1)
df[5] = df.apply(lambda x: ' ' + x[5][5:],axis=1)
df['timestamp'] = df.apply(lambda x: x[1]+"-"+str(x[2])+"-"+x[4]+" "+x[3], axis = 1)
df.to_csv('output.txt',columns=['timestamp',5],header=False, index=False)

Answer 3

由於您有固定的職位和從月份名稱到數字的簡單轉換，這應該有效

#! /usr/bin/env python3

m = {'Jan':'01', 'Feb':'02', 'Mar':'03', 'Apr':'04', 'May':'05', 'Jun':'06', 'Jul':'07', 'Aug':'08', 'Sep':'09', 'Oct':'10', 'Nov':'11', 'Dec':'12'}
with open('dq_barorun_20180715_calibtest.log', 'r') as fh, open('output.txt' , 'w') as fh2:
    for line in fh:
        day = line[9:11]
        month = m[line[5:8]]
        year = line[25:29]
        time = line[12:24]
        val = line[36:44]
        print('{}-{}-{} {}, {}'.format(month, day, year, time, val), file=fh2)

Answer 4

這是使用 Pandas read_csv 的另一種方法，如果你有很多大文件，你可以使用dask它也支持這個。

import pandas as pd
import datetime

df = pd.read_csv('D:\\test.txt',sep='\*0001')

df.columns = ['dates','val']
df.dates = pd.to_datetime(df.dates.str[1:-2])
df.to_csv("output.csv",header=None,index=None)

然后，您可以使用不同的方法將日期轉換為您想要的格式。

Answer 5

我可能有一個完全不同的想法，但它並沒有加快解析速度。
如果這不是您想要的，因為您要求更快的解析，請發表評論，我將刪除此答案。

您應該將大型輸入文件拆分為較小的段。 例如，您可以嘗試獲取行數並將其除以合適的數字。 假設有 40,000 行，那么您可以將其分成 4 段，每段接近 10,000 行。 記下起始索引以及要處理的行數。 （請注意，最后一個偏移量可能小於 10,000。 ）

然后將輸入文件傳遞給多個線程，這些線程只讀取從start-index到輸入文件offset的給定部分並解析這部分。 各個線程將解析的部分寫入共享文件夾中，但使用索引文件名。 每個線程完成后，您可以將共享文件夾中的所有文件合並為一個output.txt 。

以下是一些有關如何實現其中一些目標的鏈接資源：

Answer 6

為避免打開文件過大，導致內存占用過多，本代碼使用Python yield技術，使用常規內容。 具體的實際性能，可以和自己寫的代碼進行性能對比。

以下代碼已在本地運行！

import datetime
import re

# File path to be cleaned
CONTENT_PATH = './test03.txt'
# File path of cleaning results
RESULT_PATH = './test03.log'


def read(file):
    with open(file) as obj:
        while True:
            line = obj.readline()
            if line:
                yield line
            else:
                return


def tsplit(line):
    _m, _d, _y = line[5:8], line[9:11], line[25:29]
    _m = datetime.datetime.strptime(_m, '%b').strftime('%m')
    rt = "%s-%s-%s" % (_m, _d, _y)
    return rt


hour_min_sen = re.compile(r'(\d{2}:\d{2}:\d{2}.\d{2})')
end = re.compile(r'(\d{2}\.\d{5})')


with open(RESULT_PATH, 'a+') as obj:
    for line in read(CONTENT_PATH):
        line = line.strip()
        """
        [Sun Jul 15 09:06:01.732 2018] *000129.32352
        """
        group = hour_min_sen.findall(line)
        end_group = end.findall(line)
        """
        07-15-2018 09:05:56.724, 29.32347
        """
        obj.write("%s %s, %s\n" % (tsplit(line), group[0], end_group[0]))

有沒有更快的方法來解析這個文本文件？

問題描述

6 個解決方案

解決方案1
3 已采納 2020-11-24 07:27:18

解決方案2
0 2020-11-24 07:14:12

解決方案3
0 2020-11-24 07:26:33

解決方案4
0 2020-11-24 07:36:43

解決方案5
0 2020-11-24 08:41:06

解決方案6
-1 2020-11-24 08:25:51

有沒有更快的方法來解析這個文本文件？

問題描述

6 個解決方案

解決方案1 3 已采納 2020-11-24 07:27:18

解決方案2 0 2020-11-24 07:14:12

解決方案3 0 2020-11-24 07:26:33

解決方案4 0 2020-11-24 07:36:43

解決方案5 0 2020-11-24 08:41:06

解決方案6 -1 2020-11-24 08:25:51

解決方案1
3 已采納 2020-11-24 07:27:18

解決方案2
0 2020-11-24 07:14:12

解決方案3
0 2020-11-24 07:26:33

解決方案4
0 2020-11-24 07:36:43

解決方案5
0 2020-11-24 08:41:06

解決方案6
-1 2020-11-24 08:25:51