How can I speed up the data reading and type converting using numpy? I face in addition the issue of getting numpy.void type objects, because of the heterogeneous arrays as far as I know, instead of ndarrays. I have created a simple test that shows numpy.genfromtxt
is slower than pure python code, but I am sure there must be a better way. I couldn't manage to make numpy.loadtxt
work.
How can I improve the performance? And how to get ndarray sub-arrays as result?
import timeit
import numpy as np
line = "QUAD4 1 123456 123456781.2345671.2345671.234567 "
text = [line + "\n" for x in range(1000000)]
with open("testQUADs","w") as f:
f.writelines(text)
setup="""
import numpy as np
"""
st="""
with open("testQUADs", "r") as f:
fn = f.readlines()
for i, line in enumerate(fn):
l = [line[0:8], line[8:16], line[16:24], line[24:32], line[32:40], line[40:48], line[48:56], line[56:64], line[64:72], line[72:80]]
fn[i] = [l[0].strip(), int(l[1]), int(l[2]), int(l[3]), float(l[4]), float(l[5]), float(l[6]), l[7].strip()]
fn = np.array(fn)
"""
stnp="""
array = np.genfromtxt("testQUADs", delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")
print(array[0])
print(type(array[0]))
"""
print(timeit.timeit(st, setup=setup, number=1))
print(timeit.timeit(stnp, setup=setup, number=1))
Output:
4.560215269000764
(b'QUAD4 ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, b' ')
<class 'numpy.void'>
6.360823633000109
What you get from
array = np.genfromtxt("testQUADs", delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")
is a structured array
.
array.dtype
will look like
np.dtype("|S8, i4, i4, i4, f8, f8, f8, |S8")
array.shape
is the number of rows; it's a 1d array with 8 fields.
array[0]
is one element or record of this array; look at its dtype
. Don't worry about its type
(void is just the type of a compound dtype
record).
array['f0']
is the first field, all rows, in this case an array of strings.
You may need to read the dtype
and structured
array docs in more depth. Many SO posters have been confused about the 1d structured array that genfromtxt
produces.
genfromtxt
reads the file just like your code does, and splits each line into strings. Then it converts those strings according to the dtype
, and collects the results in a list. At the end it assembles that list into array
- this 1d array of the specified dtype. Since it is doing more than your code, it's not surprising that it is a bit slower.
loadtxt
does much the same, with less power in certain areas.
pandas
has a csv reader that is faster because it uses more compiled code. But a dataframe isn't any easier to understand than a structured array.
Your 2 methods don't produce the same thing:
In [105]: line = "QUAD4 1 123456 123456781.2345671.2345671.234567 "
In [106]: txt=[line,line,line] # a list of lines instead of a file
In [107]: A = np.genfromtxt(txt, delimiter=8, dtype="|S8, i4, i4, i4, f8, f8, f8, |S8")
In [108]: A
Out[108]:
array([ ('QUAD4 ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ' '),
('QUAD4 ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ' '),
('QUAD4 ', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ' ')],
dtype=[('f0', 'S8'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', 'S8')])
Note the dtype
; and 3 elements
Your line parser:
In [109]: fn=txt[:]
In [110]: for i, line in enumerate(fn):
l = [line[0:8], line[8:16], line[16:24], line[24:32], line[32:40], line[40:48], line[48:56], line[56:64], line[64:72], line[72:80]]
fn[i] = [l[0].strip(), int(l[1]), int(l[2]), int(l[3]), float(l[4]), float(l[5]), float(l[6]), l[7].strip()]
.....:
In [111]: fn
Out[111]:
[['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''],
['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''],
['QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '']]
In [112]: A1=np.array(fn)
In [113]: A1
Out[113]:
array([['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
'1.234567', ''],
['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
'1.234567', ''],
['QUAD4', '1', '123456', '12345678', '1.234567', '1.234567',
'1.234567', '']],
dtype='|S8')
fn
is a list of lists, which can have the diverse types of values. But when you put it into an array, it turns everthing into a strings.
I could turn your fn
list into a structured array with:
In [120]: np.array([tuple(l) for l in fn],dtype=A.dtype)
Out[120]:
array([('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''),
('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, ''),
('QUAD4', 1, 123456, 12345678, 1.234567, 1.234567, 1.234567, '')],
dtype=[('f0', 'S8'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<f8'), ('f5', '<f8'), ('f6', '<f8'), ('f7', 'S8')])
That's the same as A
from genfromtxt
except for the padding of the strings.
Here's a variation that might be useful, though it might also stretch your knowledge of structured array:
In [132]: dt=np.dtype('a8,(3)i,(3)f,a8')
In [133]: A = np.genfromtxt(txt, delimiter=8, dtype=dt)
A
now has 4 fields, two of which have multiple values
A['f1']
will return a (n,3) array of ints.
You have also :
np.loadtxt
You can use it if you're sure that each row gets the same number of values. But, all is said from the previous answer ;)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.