简体   繁体   中英

large csv file makes segmentation fault for numpy.genfromtxt

I'd really like to create a numpy array from a csv file, however, I'm having issues when the file is ~50k lines long (like the MNIST training set). The file I'm trying to import looks something like this:

0.0,0.0,0.0,0.5,0.34,0.24,0.0,0.0,0.0
0.0,0.0,0.0,0.4,0.34,0.2,0.34,0.0,0.0
0.0,0.0,0.0,0.34,0.43,0.44,0.0,0.0,0.0
0.0,0.0,0.0,0.23,0.64,0.4,0.0,0.0,0.0

It works fine for something thats 10k lines long, like the validation set:

import numpy as np
csv = np.genfromtxt("MNIST_valid_set_data.csv",delimiter = ",")

If I do the same with the training data (larger file), I'll get a c-style segmentation fault. Does anyone know any better ways besides breaking the file up and then piecing it together?

The end result is that I'd like to pickle the arrays into a similar mnist.pkl.gz file but I can't do that if I can't read in the data.

Any help would be greatly appreciated.

I think you really want to track down the actual problem and solve it, rather than just work around it, because I'll bet you have other problems with your NumPy installation that you're going to have to deal with eventually.

But, since you asked for a workaround that's better than manually splitting the files, reading them, and merging them, here are two:


First, you can split the files programmatically and dynamically, instead of manually. This avoids wasting a lot of your own human effort, and also saves the disk space needed for those copies, even though it's conceptually the same thing you already know how to do.

As the genfromtxt docs make clear, the fname argument can be a pathname, or a file object (open in 'rb' mode), or just a generator of lines (as bytes ). Of course a file object is itself a generator of lines, but so is, say, an islice of a file object, or a group from a grouper . So:

import numpy as np
from more_itertools import grouper

def getfrombigtxt(fname, *args, **kwargs):
    with open(fname, 'rb') as f:
        return np.vstack(np.genfromtxt(group, *args, **kwargs) 
                         for group in grouper(f, 5000, b''))

If you don't want to install more_itertools , you can also just copy the 2-line grouper implementation from the Recipes section of the itertools docs, or even inline zipping the iterators straight into your code.


Alternatively, you can parse the CSV file with the stdlib's csv module instead of with NumPy:

import csv
import numpy as np

def getfrombigtxt(fname, delimiter=','):
    with open(fname, 'r') as f: # note text mode, not binary
        rows = (list(map(float, row)) for row in csv.reader(f))
        return np.vstack(rows)

This is obviously going to be a lot slower… but if we're talking about turning 50ms of processing into 1000ms, and you only do it once, who cares?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM