Can't open a to big .csv file in python

Question

I tried to open a big .csv file in python to seperate each row and append the last x lines in a new list.

btcDatear = []
btcPricear = []
btcVolumear = []
howfarback = 20000
try:
    sourceCode = open('.btceUSD.csv', 'r')
    splitSource = sourceCode.split('\n')

        for eachline in splitSource[-howfarback:]:
            splitLine = eachline.split(',')
            btcDate = splitLine[0]
            btcPrice = splitLine[1]
            btcVolume = splitLine[2]

            btcDatear.append(float(btcDate))
            btcPricear.append(float(btcPrice))
            btcVolumear.append(float(btcVolume))


except Exception, e:
    print "failed raw data", str(e)

I succeed with a smaller file of 20 mb and this one is 700 mb so i think there is nothing wrong with my code. Is there a better way to make three separete lists of the three columns? I need the last x numbers. or could i remove the first 200.000 so my file is small enough to pass it through my code?

To do one of this things it has to be under +-3 minutes if it's possible.

Answer 1

You can't "split a file", but you can read it line by line no matter how big. Eg:

import collections

btcDatear = []
btcPricear = []
btcVolumear = []
howfarback = 20000
try:
    with open('.btceUSD.csv', 'r') as sourceCode:
        lastNlines = collections.deque(sourceCode, howfarback)
    for eachline in lastNlines:
        splitLine = eachline.split(',')
        btcDate = splitLine[0]
        btcPrice = splitLine[1]
        btcVolume = splitLine[2]

        btcDatear.append(float(btcDate))
        btcPricear.append(float(btcPrice))
        btcVolumear.append(float(btcVolume))
except Exception as e:
    print "failed raw data", str(e)

Building a deque with a maximum length of howfarback is the best way to keep the last N lines of a file that you can only read line by line from the start. The with statement ensures the file is properly closed no matter what; the rest of the logic is like in your code. It would be better to apply the standard library csv module, but, one bit of learning at a tie:-).

There may be tricks (subtly exploiting the fact that the CSV file is likely to be seekable) to get "the last N lines" faster -- in Unixy systems, the tail system command is very good at that. If the performance of this straightforward approach is too slow for you, ask again and we'll discuss that:-) [and/or how the csv module is best used...]

Added: come to think of it, no need to belabor "tail tricks", as they're well explained at Get last n lines of a file with Python, similar to tail -- the question is by a Python guru, Armin Ronacher, so you can be pretty confident of the quality of his code, and the answers and long discussion are interesting.

So if this simple approach takes too long, study Armin's and his respondents'... very tricky but can be truly useful.

So we might as well focus on the use of the csv module, after an import csv at the start to be sure -- rewriting only the changing part...:

    for fields in csv.reader(iter(lastNlines)):
        btcDate, btcPrice, btcVolume = fields[:3]

all the rest as before. csv.reader takes care of CSV parsing (you may not need the subtleties such as dealing with quoted/escaped commas but you pay no extra there!-) and leaves your code more concise and elegant.

Can't open a to big .csv file in python

Question

1 answers

solution1
2 ACCPTED 2014-12-29 16:05:04

Can't open a to big .csv file in python

Question

1 answers

solution1 2 ACCPTED 2014-12-29 16:05:04

solution1
2 ACCPTED 2014-12-29 16:05:04