簡體   English   中英

如何在讀取巨大的 csv 文件並使用 python 中的塊創建數據幀時刪除/忽略無效格式的數據

[英]How to remove/ignore invalid formatted data while reading a huge csv file and creating a Dataframe using chunks in python

我有一個巨大的 CSV 日志文件(超過 200,000 個條目)。 我使用塊來讀取文件,然后附加塊以獲取整個文件作為數據框。 有時奇怪的值/無效格式會出現在日志文件中。 我想丟棄/忽略錯誤格式的數據,只過濾掉正確的格式,然后處理數據框。 以下是相同的示例文件:

ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"

17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648 
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"

下面是我使用的示例代碼:

MyList=[]
Chunksize=10
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,low_memory=False,chunksize=Chunksize)
    MyList.append(chunk)
    len(MyList)
    
df1=pd.concat(MyList,axis=0)
latest_Update=df1["hh:mm:ss"].max()
print(type(latest_Update))
print(latest_Update)

輸出

<class 'str'>
TLSV12

我只想要時間格式或確切的“hh:mm:ss”格式作為 df1["hh:mm:ss"] 中的字符串。這樣我就可以計算當前時間和 latest_Update 之間的時間差。 如何從這些列中過濾無效類型? 我也嘗試過 date_parser,但輸出相同。

custom_date_parser=lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True)
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,custom_date_parser,low_memory=False,chunksize=Chunksize)

我也嘗試使用下面的,它給了我作為時間戳的數據,它滿足了要求,但執行時間太長。 所以,這需要優化:

df1["hh:mm:ss"]=df1["hh:mm:ss"].apply(lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True))

這可能不是解決方案,而是提供一些思考的食物......

import pandas as pd
import re
from dateutil import parser

'''
test_text.txt contains:
ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"

17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648 
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"

'''

# regex pattern to find just date strings like 17Jul2021
pattern = r'^[0-9]{0,2}[A-Za-z]{3}[0-9]{0,4}'

# open test CSV and split on commas
df = pd.read_csv('test_text.txt', sep=",")

# create filter using regex to identify valid lines
filter = df['ddmmyyyy'].str.contains(pattern)

# drop all rows apart from valid ones
df = df[filter]

# combine the date and time columns together
df['Timestamp'] = df['ddmmyyyy'] + ' ' + df['hh:mm:ss']

# MAYBE this is the only line of interest?
# using from dateutil import parser ...format the new Timestamp column to datetime
df['Timestamp'] = [parser.parse(row) for row in df['Timestamp']]

# set column order in list and apply to dataframe
cols = ['Timestamp', 'ddmmyyyy', 'hh:mm:ss', 'FileName', 'Function', 'Bytes', 'MsgText']
df = df[cols]

# display dataframe
print(df)

輸出:

在此處輸入圖片說明

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM