简体   繁体   English

从python中的大型csv数据文件中提取几行数据的有效方法

[英]efficient way to extract few lines of data from a large csv data file in python

I have a large number of csv data files, and each data file contains several days worth of tick data for one ticker in the following form : 我有大量的csv数据文件,并且每个数据文件都包含以下代码的几天价值的一个报价器的报价数据:

 ticker  DD/MM/YYYY    time         bid      ask
  XXX,   19122014,  08:00:08.325,  9929.00,9933.00
  XXX,   19122014,  08:00:08.523,  9924.00,9931.00
  XXX,   19122014,  08:00:08.722,  9925.00,9930.50
  XXX,   19122014,  08:00:08.921,  9924.00,9928.00
  XXX,   19122014,  08:00:09.125,  9924.00,9928.00
  …
  XXX,   30122014,  21:56:25.181,  9795.50,9796.50
  XXX,   30122014,  21:56:26.398,  9795.50,9796.50
  XXX,   30122014,  21:56:26.598,  9795.50,9796.50
  XXX,   30122014,  21:56:26.798,  9795.50,9796.50
  XXX,   30122014,  21:56:28.896,  9795.50,9796.00
  XXX,   30122014,  21:56:29.096,  9795.50,9796.50
  XXX,   30122014,  21:56:29.296,  9795.50,9796.00
  …

I need to extract any lines of data whose time is within certain range, say: 09:00:00 to 09:15:00. 我需要提取时间在一定范围内的任何数据行,例如:09:00:00到09:15:00。 My current solution is simply reading in each data file to a data frame, sorting it in order by time and then using searchsorted to find 09:00:00 to 09:15:00. 我当前的解决方案是将每个数据文件读入一个数据帧,按时间顺序对其进行排序,然后使用searchsorted查找09:00:00到09:15:00。 It works fine if performance isn't an issue and I don't have 1000 files waiting to be processed. 如果性能不是问题,并且我没有等待处理的1000个文件,它会很好地工作。 Any suggestions on how to boost the speed? 关于如何提高速度有什么建议吗? Thanks for help in advance!!! 预先感谢您的帮助!!!

Short answer: put your data in an SQL database, and give the "time" column an index. 简短的答案:将您的数据放入SQL数据库,并为“时间”列提供索引。 You can't beat that with CSV files - using Pandas or not. 您不能使用CSV文件来击败它-是否使用熊猫。

Without changing your CSV files, one thign a little bit faster, but not much would be to filter the rows as you read them - and have in memory just the ones that are interesting for you. 在不更改CSV文件的情况下,速度加快了一点,但是在读取行时过滤它们并没有多大的作用-并在内存中仅存储您感兴趣的行。

So instead of just getting the whole CSV into memory, a function like such could do the job: 因此,像将整个CSV放入内存中一样,可以完成以下工作:

import csv

def filter_time(filename, mintime, maxtime):
    timecol = 3
    reader = csv.reader(open(filename))
    next(reader)
    return [line for line in reader if mintime <= line[timecol] <= maxtime]

This task can be easilyt paralyzed - you could get some instances of this running concurrently before maxing the I/O on your device, I'd guess. 这项任务很容易瘫痪-我猜想,在最大化设备上的I / O数量之前,您可以同时运行该实例。 One painless way to do that would be using the lelo Python package - it just provides you a @paralel decorator that makes the given function run in another process when called, and returns a lazy proxy for the results. 一种lelo方法是使用lelo Python软件包-它只为您提供@paralel装饰器,该装饰器使给定函数在调用时在另一个进程中运行,并返回结果的惰性代理。

But that will still have to read everything in - I think the SQL solution should be about at least one order of magnitude faster. 但这仍然需要阅读所有内容-我认为SQL解决方案应该至少快大约一个数量级。

My solution would be to read line by line and saving only what pass your filter: 我的解决方案是逐行读取并仅保存通过过滤器的内容:

with open('filename.csv') as fin:
    with open('fileout.csv', 'w') as fout:
        while True:
            line = fin.readline()
            if not line:
                break
            time_x = line.rstrip('\n').split(',')[2]
            # some parsing of time to do properly
            if a< time_x < b:
                fout.write(line)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM