简体   繁体   中英

numpy Loadtxt function seems to be consuming too much memory

When I load an array using numpy.loadtxt, it seems to take too much memory. Eg

a = numpy.zeros(int(1e6))

causes an increase of about 8MB in memory (using htop, or just 8bytes*1million \\approx 8MB). On the other hand, if I save and then load this array

numpy.savetxt('a.csv', a)
b = numpy.loadtxt('a.csv')

my memory usage increases by about 100MB! Again I observed this with htop. This was observed while in the iPython shell, and also while stepping through code using Pdb++.

Any idea what's going on here?

After reading jozzas's answer, I realized that if I know ahead of time the array size, there is a much more memory efficient way to do things if say 'a' was an mxn array:

b = numpy.zeros((m,n))
with open('a.csv', 'r') as f:
    reader = csv.reader(f)
    for i, row in enumerate(reader):
        b[i,:] = numpy.array(row)

Saving this array of floats to a text file creates a 24M text file. When you re-load this, numpy goes through the file line-by-line, parsing the text and recreating the objects.

I would expect memory usage to spike during this time, as numpy doesn't know how big the resultant array needs to be until it gets to the end of the file, so I'd expect there to be at least 24M + 8M + other temporary memory used.

Here's the relevant bit of the numpy code, from /lib/npyio.py :

    # Parse each line, including the first
    for i, line in enumerate(itertools.chain([first_line], fh)):
        vals = split_line(line)
        if len(vals) == 0:
            continue
        if usecols:
            vals = [vals[i] for i in usecols]
        # Convert each value according to its column and store
        items = [conv(val) for (conv, val) in zip(converters, vals)]
        # Then pack it according to the dtype's nesting
        items = pack_items(items, packing)
        X.append(items)

    #...A bit further on
    X = np.array(X, dtype)

This additional memory usage shouldn't be a concern, as this is just the way python works - while your python process appears to be using 100M of memory, internally it maintains knowledge of which items are no longer used, and will re-use that memory. For example, if you were to re-run this save-load procedure in the one program (save, load, save, load), your memory usage will not increase to 200M.

Here is what I ended up doing to solve this problem. It works even if you don't know the shape ahead of time. This performs the conversion to float first, and then combines the arrays (as opposed to @JohnLyon's answer , which combines the arrays of string then converts to float). This used an order of magnitude less memory for me, although perhaps was a bit slower. However, I literally did not have the requisite memory to use np.loadtxt , so if you don't have sufficient memory, then this will be better:

def numpy_loadtxt_memory_friendly(the_file, max_bytes = 1000000, **loadtxt_kwargs):
    numpy_arrs = []
    with open(the_file, 'rb') as f:
        i = 0
        while True:
            print(i)
            some_lines = f.readlines(max_bytes)
            if len(some_lines) == 0:
                break
            vec = np.loadtxt(some_lines, **loadtxt_kwargs)
            if len(vec.shape) < 2:
                vec = vec.reshape(1,-1)
            numpy_arrs.append(vec)
            i+=len(some_lines)
    return np.concatenate(numpy_arrs, axis=0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM