简体   繁体   English

如何在读取巨大的 csv 文件并使用 python 中的块创建数据帧时删除/忽略无效格式的数据

[英]How to remove/ignore invalid formatted data while reading a huge csv file and creating a Dataframe using chunks in python

I have a huge CSV log file (200,000+ entries).我有一个巨大的 CSV 日志文件(超过 200,000 个条目)。 I am using chunks to read the file and then appending the chunks to get the entire file as data frame.我使用块来读取文件,然后附加块以获取整个文件作为数据框。 Sometimes weird values/invalid formats arrive in the log file.有时奇怪的值/无效格式会出现在日志文件中。 I want to discard/ignore the wrong formatted data and filter out only the correct format and then work on the data frame.我想丢弃/忽略错误格式的数据,只过滤掉正确的格式,然后处理数据框。 Below is the sample file of the same:以下是相同的示例文件:

ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"

17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648 
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"

Below sample code i am using:下面是我使用的示例代码:

MyList=[]
Chunksize=10
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,low_memory=False,chunksize=Chunksize)
    MyList.append(chunk)
    len(MyList)
    
df1=pd.concat(MyList,axis=0)
latest_Update=df1["hh:mm:ss"].max()
print(type(latest_Update))
print(latest_Update)

Output输出

<class 'str'>
TLSV12

I want only the time format or exactly "hh:mm:ss" format as a string in the df1["hh:mm:ss"].So that I can calculate the time difference between the current time and the latest_Update.我只想要时间格式或确切的“hh:mm:ss”格式作为 df1["hh:mm:ss"] 中的字符串。这样我就可以计算当前时间和 latest_Update 之间的时间差。 How can I filter the invalid types from these columns?如何从这些列中过滤无效类型? I tried date_parser as well, but the same output.我也尝试过 date_parser,但输出相同。

custom_date_parser=lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True)
for chunk in pd.read_csv(g_log,delimiter=',',usecols=["ddmmyyyy","hh:mm:ss","Function"],index_col=None,custom_date_parser,low_memory=False,chunksize=Chunksize)

I also tried using the below, which gave me data as Timestamp, It is fulfilling the requirement, but it is taking too long to execute.我也尝试使用下面的,它给了我作为时间戳的数据,它满足了要求,但执行时间太长。 So, this needs to be optimized:所以,这需要优化:

df1["hh:mm:ss"]=df1["hh:mm:ss"].apply(lambda x:pd.to_datetime(x,errors,'coerce',infer_datetime_format=True))

This is probably not the solution, but to offer some food for thought...这可能不是解决方案,而是提供一些思考的食物......

import pandas as pd
import re
from dateutil import parser

'''
test_text.txt contains:
ddmmyyyy,hh:mm:ss,FileName,Function,Bytes,MsgText
17Jul2021,14:21:46,StatFile,Upload,1,"copy success"
17Jul2021,14:22:42,AuditFile,Download,1,"Download success"
17Jul2021,15:21:46,ReactFile,Upload,1,"copy success"

17Jul2021,15:23:46,StatFile,Upload,1,"copy success"
17Jul2021,16:30:46,StatFile,Upload,0,"copy success"
-,-,-,
17Jul2021,17:21:42,StatFile,Upload,1,"copy success"
ep success.",TLSV12,2546,-,1,25648 
17Jul2021,17:50:46,StatFile,Upload,1,"copy success"
1,-32,328,280,Extend,s
17Jul2021,18:19:06,AuditFile,Download,2,"Download success"

'''

# regex pattern to find just date strings like 17Jul2021
pattern = r'^[0-9]{0,2}[A-Za-z]{3}[0-9]{0,4}'

# open test CSV and split on commas
df = pd.read_csv('test_text.txt', sep=",")

# create filter using regex to identify valid lines
filter = df['ddmmyyyy'].str.contains(pattern)

# drop all rows apart from valid ones
df = df[filter]

# combine the date and time columns together
df['Timestamp'] = df['ddmmyyyy'] + ' ' + df['hh:mm:ss']

# MAYBE this is the only line of interest?
# using from dateutil import parser ...format the new Timestamp column to datetime
df['Timestamp'] = [parser.parse(row) for row in df['Timestamp']]

# set column order in list and apply to dataframe
cols = ['Timestamp', 'ddmmyyyy', 'hh:mm:ss', 'FileName', 'Function', 'Bytes', 'MsgText']
df = df[cols]

# display dataframe
print(df)

Outputs:输出:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM