简体   繁体   中英

Converting a 2D numpy array to a structured array

I'm trying to convert a two-dimensional array into a structured array with named fields. I want each row in the 2D array to be a new record in the structured array. Unfortunately, nothing I've tried is working the way I expect.

I'm starting with:

>>> myarray = numpy.array([("Hello",2.5,3),("World",3.6,2)])
>>> print myarray
[['Hello' '2.5' '3']
 ['World' '3.6' '2']]

I want to convert to something that looks like this:

>>> newarray = numpy.array([("Hello",2.5,3),("World",3.6,2)], dtype=[("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[('Hello', 2.5, 3L) ('World', 3.6000000000000001, 2L)]

What I've tried:

>>> newarray = myarray.astype([("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[[('Hello', 0.0, 0L) ('2.5', 0.0, 0L) ('3', 0.0, 0L)]
 [('World', 0.0, 0L) ('3.6', 0.0, 0L) ('2', 0.0, 0L)]]

>>> newarray = numpy.array(myarray, dtype=[("Col1","S8"),("Col2","f8"),("Col3","i8")])
>>> print newarray
[[('Hello', 0.0, 0L) ('2.5', 0.0, 0L) ('3', 0.0, 0L)]
 [('World', 0.0, 0L) ('3.6', 0.0, 0L) ('2', 0.0, 0L)]]

Both of these approaches attempt to convert each entry in myarray into a record with the given dtype, so the extra zeros are inserted. I can't figure out how to get it to convert each row into a record.

Another attempt:

>>> newarray = myarray.copy()
>>> newarray.dtype = [("Col1","S8"),("Col2","f8"),("Col3","i8")]
>>> print newarray
[[('Hello', 1.7219343871178711e-317, 51L)]
 [('World', 1.7543139673493688e-317, 50L)]]

This time no actual conversion is performed. The existing data in memory is just re-interpreted as the new data type.

The array that I'm starting with is being read in from a text file. The data types are not known ahead of time, so I can't set the dtype at the time of creation. I need a high-performance and elegant solution that will work well for general cases since I will be doing this type of conversion many, many times for a large variety of applications.


You can "create a record array from a (flat) list of arrays" using numpy.core.records.fromarrays as follows:

>>> import numpy as np
>>> myarray = np.array([("Hello",2.5,3),("World",3.6,2)])
>>> print myarray
[['Hello' '2.5' '3']
 ['World' '3.6' '2']]

>>> newrecarray = np.core.records.fromarrays(myarray.transpose(), 
                                             names='col1, col2, col3',
                                             formats = 'S8, f8, i8')

>>> print newrecarray
[('Hello', 2.5, 3) ('World', 3.5999999046325684, 2)]

I was trying to do something similar. I found that when numpy created a structured array from an existing 2D array (using np.core.records.fromarrays), it considered each column (instead of each row) in the 2-D array as a record. So you have to transpose it. This behavior of numpy does not seem very intuitive, but perhaps there is a good reason for it.

If the data starts as a list of tuples, then creating a structured array is straight forward:

In [228]: alist = [("Hello",2.5,3),("World",3.6,2)]
In [229]: dt = [("Col1","S8"),("Col2","f8"),("Col3","i8")]
In [230]: np.array(alist, dtype=dt)
array([(b'Hello',  2.5, 3), (b'World',  3.6, 2)], 
      dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])

The complication here is that the list of tuples has been turned into a 2d string array:

In [231]: arr = np.array(alist)
In [232]: arr
array([['Hello', '2.5', '3'],
       ['World', '3.6', '2']], 

We could use the well known zip* approach to 'transposing' this array - actually we want a double transpose:

In [234]: list(zip(*arr.T))
Out[234]: [('Hello', '2.5', '3'), ('World', '3.6', '2')]

zip has conveniently given us a list of tuples. Now we can recreate the array with desired dtype:

In [235]: np.array(_, dtype=dt)
array([(b'Hello',  2.5, 3), (b'World',  3.6, 2)], 
      dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])

The accepted answer uses fromarrays :

In [236]: np.rec.fromarrays(arr.T, dtype=dt)
rec.array([(b'Hello',  2.5, 3), (b'World',  3.6, 2)], 
          dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])

Internally, fromarrays takes a common recfunctions approach: create target array, and copy values by field name. Effectively it does:

In [237]: newarr = np.empty(arr.shape[0], dtype=dt)
In [238]: for n, v in zip(newarr.dtype.names, arr.T):
     ...:     newarr[n] = v
In [239]: newarr
array([(b'Hello',  2.5, 3), (b'World',  3.6, 2)], 
      dtype=[('Col1', 'S8'), ('Col2', '<f8'), ('Col3', '<i8')])

I guess

new_array = np.core.records.fromrecords([("Hello",2.5,3),("World",3.6,2)],

is what you want.

There's a lot of confusion here between "record array" and "structured array". Here's my short solution for a structured array.

dtype = np.dtype([("Col1","S8"),("Col2","f8"),("Col3","i8")])
myarray = np.array([("Hello",2.5,3),("World",3.6,2)], dtype=dtype)
np.array(np.rec.fromarrays(myarray.transpose(), names=dtype.names).astype(dtype=dtype).tolist(), dtype=dtype)

So, with the assumption that dtype is defined, this is a one-liner.

Okay, I have been struggling with this for a while now but I have found a way to do this that doesn't take too much effort. I apologise if this code is "dirty"....

Let's start with a 2D array:

mydata = numpy.array([['text1', 1, 'longertext1', 0.1111],
                     ['text2', 2, 'longertext2', 0.2222],
                     ['text3', 3, 'longertext3', 0.3333],
                     ['text4', 4, 'longertext4', 0.4444],
                     ['text5', 5, 'longertext5', 0.5555]])

So we end up with a 2D array with 4 columns and 5 rows:

Out[30]: (5L, 4L)

To use numpy.core.records.arrays - we need to supply the input argument as a list of arrays so:

(array(['text1', '1', 'longertext1', '0.1111'], 
 array(['text2', '2', 'longertext2', '0.2222'], 
 array(['text3', '3', 'longertext3', '0.3333'], 
 array(['text4', '4', 'longertext4', '0.4444'], 
 array(['text5', '5', 'longertext5', '0.5555'], 

This produces a separate array per row of data BUT, we need the input arrays to be by column so what we will need is:

(array(['text1', 'text2', 'text3', 'text4', 'text5'], 
 array(['1', '2', '3', '4', '5'], 
 array(['longertext1', 'longertext2', 'longertext3', 'longertext4',
 array(['0.1111', '0.2222', '0.3333', '0.4444', '0.5555'], 

Finally it needs to be a list of arrays, not a tuple, so we wrap the above in list() as below:


That is our data input argument sorted.... next is the dtype:

mydtype = numpy.dtype([('My short text Column', 'S5'),
                       ('My integer Column', numpy.int16),
                       ('My long text Column', 'S11'),
                       ('My float Column', numpy.float32)])
Out[37]: dtype([('My short text Column', '|S5'), ('My integer Column', '<i2'), ('My long text Column', '|S11'), ('My float Column', '<f4')])

Okay, so now we can pass that to the numpy.core.records.array():

myRecord = numpy.core.records.array(list(tuple(mydata.transpose())), dtype=mydtype)

... and fingers crossed:

rec.array([('text1', 1, 'longertext1', 0.11110000312328339),
       ('text2', 2, 'longertext2', 0.22220000624656677),
       ('text3', 3, 'longertext3', 0.33329999446868896),
       ('text4', 4, 'longertext4', 0.44440001249313354),
       ('text5', 5, 'longertext5', 0.5554999709129333)], 
      dtype=[('My short text Column', '|S5'), ('My integer Column', '<i2'), ('My long text Column', '|S11'), ('My float Column', '<f4')])

Voila! You can index by column name as in:

myRecord['My float Column']
Out[39]: array([ 0.1111    ,  0.22220001,  0.33329999,  0.44440001,  0.55549997], dtype=float32)

I hope this helps as I wasted so much time with numpy.asarray and mydata.astype etc trying to get this to work before finally working out this method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM