简体   繁体   English

在python中读取逗号分隔文件(包括日期时间)的最快方法

[英]Fastest way to read comma separated files (including datetimes) in python

I have data stored in comma delimited txt files. 我有数据存储在逗号分隔的txt文件中。 One of the columns represents a datetime. 其中一列代表日期时间。

I need to load each column into separate numpy arrays (and decode the date into a python datetime object). 我需要将每个列加载到单独的numpy数组中(并将日期解码为python datetime对象)。

What is the fastest way to do this (in terms of run time)? 最快的方法是什么(就运行时而言)?

NB. NB。 the files are several hundred MB of data and currently take several minutes to load in. 这些文件是几百MB的数据,目前需要几分钟才能加载。

eg mydata.txt 例如mydata.txt

15,3,0,2003-01-01 00:00:00,12.2
15,4.5,0,2003-01-01 00:00:00,13.7
15,6,0,2003-01-01 00:00:00,18.4
15,7.5,0,2003-01-01 00:00:00,17.9
15,9,0,2003-01-01 00:00:00,17.7
15,10.5,0,2003-01-01 00:00:00,16.3
15,12,0,2003-01-01 00:00:00,17.2

Here is my current code (it works, but is slow): 这是我当前的代码(它有效,但很慢):

import csv
import datetime
import time
import numpy

a=[]
b=[]
c=[]
d=[]
timestmp=[]

myfile = open('mydata.txt',"r")

# Read in the data
csv_reader = csv.reader(myfile)
for row in csv_reader:
  a.append(row[0])
  b.append(row[1])
  c.append(row[2])
  timestmp.append(row[3])
  d.append(row[4])

a = numpy.array(a)
b = numpy.array(b)
c = numpy.array(c)
d = numpy.array(d)

# Convert Time string list into list of Python datetime objects
times = []
time_format = "%Y-%m-%d %H:%M:%S"
for i in xrange(len(timestmp)):
  times.append(datetime.datetime.fromtimestamp(time.mktime(time.strptime(timestmp[i], time_format))))

Is there a more efficient way to do this? 有没有更有效的方法来做到这一点?

Any help is very much appreciated -thanks! 非常感谢任何帮助 - 谢谢!

(edit: In the end the bottleneck turned out to be with the datetime conversion, and not reading the file as I originally assumed.) (编辑:最后,瓶颈结果是日期时间转换,而不是像我最初假设的那样读取文件。)

First, you should run your sample script with Python's built-in profiler to see where the problem actually might be. 首先,您应该使用Python的内置分析器运行示例脚本,以查看问题的实际位置。 You can do this from the command-line: 您可以从命令行执行此操作:

python -m cProfile myscript.py

Secondly, what jumps at me at least, why is that loop at the bottom necessary? 其次,至少对我有什么影响,为什么底部的循环必要? Is there a technical reason that it can't be done while reading mydata.txt in the loop you have above the instantiation of the numpy arrays? 有没有技术上的原因,在你在numpy数组的实例化之上的循环中读取mydata.txt无法完成它?

Thirdly, you should create the datetime objects directly, as it also supports strptime. 第三,您应该直接创建datetime对象,因为它还支持strptime。 You don't need to create a time stamp, make the time, and just make a datetime from a timestamp. 您不需要创建时间戳,创建时间,只需从时间戳创建日期时间。 Your loop at the bottom can just be re-written like this: 您在底部的循环可以像这样重写:

times = []
timestamps = []
TIME_FORMAT = "%Y-%m-%d %H:%M:%S"
for t in timestmp:
    parsed_time = datetime.datetime.strptime(t, TIME_FORMAT)
    times.append(parsed_time)
    timestamps.append(time.mktime(parsed_time.timetuple()))

I too the liberty of PEP-8ing your code a bit, such as changing your constant to all caps. 我也可以自由地使用PEP-8代码,例如将常量更改为全部大写。 Also, you can iterate over a list just by using the in operator. 此外,您可以使用in运算符迭代列表。

尝试numpy.loadtxt()doc字符串有一个很好的例子。

You can also try to use copy=False when call numpy.array since the default behavior is copy it, this can speed up the script (especially since you said it process a lot of data). 你也可以尝试在调用numpy.array时使用copy=False ,因为默认行为是复制它,这可以加速脚本(特别是因为你说它处理了大量数据)。

npa = numpy.array(ar, copy=False)

If you follow Mahmoud Abdelkader's advice and use the profiler, and find out that the bottleneck is in the csv loader, you could always try replacing your csv_reader with this: 如果您遵循Mahmoud Abdelkader的建议并使用分析器,并发现瓶颈在csv加载器中,您可以尝试用以下代码替换您的csv_reader:

for line in open("ProgToDo.txt"):
  row = line.split(',')
  a.append(int(row[0]))
  b.append(int(row[1]))
  c.append(int(row[2]))
  timestmp.append(row[3])
  d.append(float(row[4]))

But more probable I think is that you have a lot of data conversions. 但我认为更有可能是你有很多数据转换。 Especially the last loop for time conversion will take a long time if you have millions of conversions! 如果你有数百万次转换,特别是时间转换的最后一个循环需要很长时间! If you succeed in doing it all in one step (read+convert), plus taking Terseus advice on not copying the arrays to numpy dittos, you will reduce execution times. 如果你成功地一步完成(读取+转换),再加上Terseus建议不将数组复制到numpy dittos,你将减少执行时间。

I'm not completely sure if this will help but you may be able to speed up the reading of the file by using ast.literal_eval . 我不完全确定这是否会有所帮助,但您可以通过使用ast.literal_eval来加快文件的读取速度。 For example: 例如:

from ast import literal_eval

myfile = open('mydata.txt',"r")
mylist = []
for line in myfile:
    line = line.strip()
    e = line.rindex(",")
    row = literal_eval('[%s"%s"%s]' % (line[:e-19], line[e-19:e], line[e:]))
    mylist.append(row)

a, b, c, timestamp, d = zip(*mylist)
# a, b, c, timestamp, and d are what they were after your csv_reader loop

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM