简体   繁体   中英

Fast data reading from text file in numpy

How can I speed up the data reading and type converting using numpy? I face in addition the issue of getting numpy.void type objects, because of the heterogeneous arrays as far as I know, instead of ndarrays. I have created a simple test that shows numpy.genfromtxt is slower than pure python code, but I am sure there must be a better way. I couldn't manage to make numpy.loadtxt work.

How can I improve the performance? And how to get ndarray sub-arrays as result?

import timeit
import numpy as np

line = "QUAD4   1       123456  123456781.2345671.2345671.234567        "
text = [line + "\n" for x in range(1000000)]
with open("testQUADs","w") as f:
    f.writelines(text)


setup="""
import numpy as np
"""

st="""
with open("testQUADs", "r") as f:
    fn = f.readlines()
for i, line in enumerate(fn):
    l = [line[0:8], line[8:16], line[16:24], line[24:32], line[32:40], line[40:48], line[48:56], line[56:64], line[64:72], line[72:80]]
    fn[i] = [l[0].strip(), int(l[1]), int(l[2]), int(l[3]), float(l[4]), float(l[5]), float(l[6]), l[7].strip()]
fn = np.array(fn)
"""

stnp="""
array = np.genfromtxt("testQUADs", delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")
print(array[0])
print(type(array[0]))
"""


print(timeit.timeit(st, setup=setup, number=1))
print(timeit.timeit(stnp, setup=setup, number=1))

Output:

4.560215269000764
(b'QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, b'        ')
<class 'numpy.void'>
6.360823633000109

What you get from

array = np.genfromtxt("testQUADs", delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")

is a structured array .

array.dtype

will look like

np.dtype("|S8, i4, i4, i4, f8, f8, f8, |S8")

array.shape is the number of rows; it's a 1d array with 8 fields.

array[0] is one element or record of this array; look at its dtype . Don't worry about its type (void is just the type of a compound dtype record).

array['f0'] is the first field, all rows, in this case an array of strings.

You may need to read the dtype and structured array docs in more depth. Many SO posters have been confused about the 1d structured array that genfromtxt produces.

genfromtxt reads the file just like your code does, and splits each line into strings. Then it converts those strings according to the dtype , and collects the results in a list. At the end it assembles that list into array - this 1d array of the specified dtype. Since it is doing more than your code, it's not surprising that it is a bit slower.

loadtxt does much the same, with less power in certain areas.

pandas has a csv reader that is faster because it uses more compiled code. But a dataframe isn't any easier to understand than a structured array.


Your 2 methods don't produce the same thing:

In [105]: line = "QUAD4   1       123456  123456781.2345671.2345671.234567        "

In [106]: txt=[line,line,line]    # a list of lines instead of a file

In [107]: A = np.genfromtxt(txt, delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")

In [108]: A
Out[108]: 
array([ ('QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '        '),
       ('QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '        '),
       ('QUAD4   ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '        ')], 
      dtype=[('f0', 'S8'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', 'S8')])

Note the dtype ; and 3 elements

Your line parser:

In [109]: fn=txt[:]    
In [110]: for i, line in enumerate(fn):
        l = [line[0:8], line[8:16], line[16:24], line[24:32], line[32:40], line[40:48], line[48:56], line[56:64], line[64:72], line[72:80]]
        fn[i] = [l[0].strip(), int(l[1]), int(l[2]), int(l[3]), float(l[4]), float(l[5]), float(l[6]), l[7].strip()]
   .....:     

In [111]: fn
Out[111]: 
[['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''],
 ['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''],
 ['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '']]

In [112]: A1=np.array(fn)

In [113]: A1
Out[113]: 
array([['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
        '1.234567', ''],
       ['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
        '1.234567', ''],
       ['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
        '1.234567', '']], 
      dtype='|S8')

fn is a list of lists, which can have the diverse types of values. But when you put it into an array, it turns everthing into a strings.

I could turn your fn list into a structured array with:

In [120]: np.array([tuple(l) for l in fn],dtype=A.dtype)
Out[120]: 
array([('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''),
       ('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''),
       ('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '')], 
      dtype=[('f0', 'S8'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', 'S8')])

That's the same as A from genfromtxt except for the padding of the strings.


Here's a variation that might be useful, though it might also stretch your knowledge of structured array:

In [132]: dt=np.dtype('a8,(3)i,(3)f,a8')
In [133]: A = np.genfromtxt(txt, delimiter=8, dtype=dt)

A now has 4 fields, two of which have multiple values

A['f1'] will return a (n,3) array of ints.

You have also :

np.loadtxt

You can use it if you're sure that each row gets the same number of values. But, all is said from the previous answer ;)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM