I have a large number of csv data files, and each data file contains several days worth of tick data for one ticker in the following form :
ticker DD/MM/YYYY time bid ask
XXX, 19122014, 08:00:08.325, 9929.00,9933.00
XXX, 19122014, 08:00:08.523, 9924.00,9931.00
XXX, 19122014, 08:00:08.722, 9925.00,9930.50
XXX, 19122014, 08:00:08.921, 9924.00,9928.00
XXX, 19122014, 08:00:09.125, 9924.00,9928.00
…
XXX, 30122014, 21:56:25.181, 9795.50,9796.50
XXX, 30122014, 21:56:26.398, 9795.50,9796.50
XXX, 30122014, 21:56:26.598, 9795.50,9796.50
XXX, 30122014, 21:56:26.798, 9795.50,9796.50
XXX, 30122014, 21:56:28.896, 9795.50,9796.00
XXX, 30122014, 21:56:29.096, 9795.50,9796.50
XXX, 30122014, 21:56:29.296, 9795.50,9796.00
…
I need to extract any lines of data whose time is within certain range, say: 09:00:00 to 09:15:00. My current solution is simply reading in each data file to a data frame, sorting it in order by time and then using searchsorted to find 09:00:00 to 09:15:00. It works fine if performance isn't an issue and I don't have 1000 files waiting to be processed. Any suggestions on how to boost the speed? Thanks for help in advance!!!
Short answer: put your data in an SQL database, and give the "time" column an index. You can't beat that with CSV files - using Pandas or not.
Without changing your CSV files, one thign a little bit faster, but not much would be to filter the rows as you read them - and have in memory just the ones that are interesting for you.
So instead of just getting the whole CSV into memory, a function like such could do the job:
import csv
def filter_time(filename, mintime, maxtime):
timecol = 3
reader = csv.reader(open(filename))
next(reader)
return [line for line in reader if mintime <= line[timecol] <= maxtime]
This task can be easilyt paralyzed - you could get some instances of this running concurrently before maxing the I/O on your device, I'd guess. One painless way to do that would be using the lelo
Python package - it just provides you a @paralel
decorator that makes the given function run in another process when called, and returns a lazy proxy for the results.
But that will still have to read everything in - I think the SQL solution should be about at least one order of magnitude faster.
My solution would be to read line by line and saving only what pass your filter:
with open('filename.csv') as fin:
with open('fileout.csv', 'w') as fout:
while True:
line = fin.readline()
if not line:
break
time_x = line.rstrip('\n').split(',')[2]
# some parsing of time to do properly
if a< time_x < b:
fout.write(line)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.