简体   繁体   中英

How to remove/ignore invalid formatted data while reading a huge csv file and creating a Dataframe using chunks in python

I have a huge CSV log file (200,000+ entries). I am using chunks to read the file and then appending the chunks to get the entire file as data frame. Sometimes weird values/invalid formats arrive in the log file. I want to discard/ignore the wrong formatted data and filter out only the correct format and then work on the data frame. Below is the sample file of the same:

ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"

17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648 
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"

Below sample code i am using:

MyList=[]
Chunksize=10
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,low_memory=False,chunksize=Chunksize)
    MyList.append(chunk)
    len(MyList)
    
df1=pd.concat(MyList,axis=0)
latest_Update=df1["hh:mm:ss"].max()
print(type(latest_Update))
print(latest_Update)

Output

<class 'str'>
TLSV12

I want only the time format or exactly "hh:mm:ss" format as a string in the df1["hh:mm:ss"].So that I can calculate the time difference between the current time and the latest_Update. How can I filter the invalid types from these columns? I tried date_parser as well, but the same output.

custom_date_parser=lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True)
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,custom_date_parser,low_memory=False,chunksize=Chunksize)

I also tried using the below, which gave me data as Timestamp, It is fulfilling the requirement, but it is taking too long to execute. So, this needs to be optimized:

df1["hh:mm:ss"]=df1["hh:mm:ss"].apply(lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True))

This is probably not the solution, but to offer some food for thought...

import pandas as pd
import re
from dateutil import parser

'''
test_text.txt contains:
ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"

17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648 
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"

'''

# regex pattern to find just date strings like 17Jul2021
pattern = r'^[0-9]{0,2}[A-Za-z]{3}[0-9]{0,4}'

# open test CSV and split on commas
df = pd.read_csv('test_text.txt', sep=",")

# create filter using regex to identify valid lines
filter = df['ddmmyyyy'].str.contains(pattern)

# drop all rows apart from valid ones
df = df[filter]

# combine the date and time columns together
df['Timestamp'] = df['ddmmyyyy'] + ' ' + df['hh:mm:ss']

# MAYBE this is the only line of interest?
# using from dateutil import parser ...format the new Timestamp column to datetime
df['Timestamp'] = [parser.parse(row) for row in df['Timestamp']]

# set column order in list and apply to dataframe
cols = ['Timestamp', 'ddmmyyyy', 'hh:mm:ss', 'FileName', 'Function', 'Bytes', 'MsgText']
df = df[cols]

# display dataframe
print(df)

Outputs:

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM