[英]How to remove/ignore invalid formatted data while reading a huge csv file and creating a Dataframe using chunks in python
我有一個巨大的 CSV 日志文件(超過 200,000 個條目)。 我使用塊來讀取文件,然后附加塊以獲取整個文件作為數據框。 有時奇怪的值/無效格式會出現在日志文件中。 我想丟棄/忽略錯誤格式的數據,只過濾掉正確的格式,然后處理數據框。 以下是相同的示例文件:
ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"
17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"
下面是我使用的示例代碼:
MyList=[]
Chunksize=10
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,low_memory=False,chunksize=Chunksize)
MyList.append(chunk)
len(MyList)
df1=pd.concat(MyList,axis=0)
latest_Update=df1["hh:mm:ss"].max()
print(type(latest_Update))
print(latest_Update)
輸出
<class 'str'>
TLSV12
我只想要時間格式或確切的“hh:mm:ss”格式作為 df1["hh:mm:ss"] 中的字符串。這樣我就可以計算當前時間和 latest_Update 之間的時間差。 如何從這些列中過濾無效類型? 我也嘗試過 date_parser,但輸出相同。
custom_date_parser=lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True)
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,custom_date_parser,low_memory=False,chunksize=Chunksize)
我也嘗試使用下面的,它給了我作為時間戳的數據,它滿足了要求,但執行時間太長。 所以,這需要優化:
df1["hh:mm:ss"]=df1["hh:mm:ss"].apply(lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True))
這可能不是解決方案,而是提供一些思考的食物......
import pandas as pd
import re
from dateutil import parser
'''
test_text.txt contains:
ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"
17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"
'''
# regex pattern to find just date strings like 17Jul2021
pattern = r'^[0-9]{0,2}[A-Za-z]{3}[0-9]{0,4}'
# open test CSV and split on commas
df = pd.read_csv('test_text.txt', sep=",")
# create filter using regex to identify valid lines
filter = df['ddmmyyyy'].str.contains(pattern)
# drop all rows apart from valid ones
df = df[filter]
# combine the date and time columns together
df['Timestamp'] = df['ddmmyyyy'] + ' ' + df['hh:mm:ss']
# MAYBE this is the only line of interest?
# using from dateutil import parser ...format the new Timestamp column to datetime
df['Timestamp'] = [parser.parse(row) for row in df['Timestamp']]
# set column order in list and apply to dataframe
cols = ['Timestamp', 'ddmmyyyy', 'hh:mm:ss', 'FileName', 'Function', 'Bytes', 'MsgText']
df = df[cols]
# display dataframe
print(df)
輸出:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.