简体   繁体   English

将numpy结构化数组子集转换为numpy数组而不进行复制

[英]Converting numpy structured array subset to numpy array without copy

Suppose I have the following numpy structured array: 假设我有以下numpy结构化数组:

In [250]: x
Out[250]: 
array([(22, 2, -1000000000, 2000), (22, 2, 400, 2000),
       (22, 2, 804846, 2000), (44, 2, 800, 4000), (55, 5, 900, 5000),
       (55, 5, 1000, 5000), (55, 5, 8900, 5000), (55, 5, 11400, 5000),
       (33, 3, 14500, 3000), (33, 3, 40550, 3000), (33, 3, 40990, 3000),
       (33, 3, 44400, 3000)], 
       dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])

I am trying to modify a subset of the above array to a regular numpy array. 我试图将上述数组的子集修改为常规的numpy数组。 It is essential for my application that no copies are created (only views). 对于我的应用程序来说,不必创建任何副本(仅限视图)。

Fields are retrieved from the above structured array by using the following function: 使用以下函数从上面的结构化数组中检索字段:

def fields_view(array, fields):
    return array.getfield(numpy.dtype(
        {name: array.dtype.fields[name] for name in fields}
    ))

If I am interested in fields 'f2' and 'f3', I would do the following: 如果我对字段'f2'和'f3'感兴趣,我会做以下事情:

In [251]: y=fields_view(x,['f2','f3'])
In [252]: y
Out [252]:
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
       (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
       (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
       dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})

There is a way to directly get an ndarray from the 'f2' and 'f3' fields of the original structured array. 有一种方法可以直接从原始结构化数组的'f2'和'f3'字段中获取ndarray。 However, for my application, it is necessary to build this intermediary structured array as this data subset is an attribute of a class. 但是,对于我的应用程序,有必要构建此中间结构化数组,因为此数据子集是类的属性。

I can't convert the intermediary structured array to a regular numpy array without doing a copy. 我无法在不进行复制的情况下将中间结构化数组转换为常规numpy数组。

In [253]: y.view(('<f4', len(y.dtype.names)))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-f8fc3a40fd1b> in <module>()
----> 1 y.view(('<f4', len(y.dtype.names)))

ValueError: new type not compatible with array.

This function can also be used to convert a record array to an ndarray: 此函数也可用于将记录数组转换为ndarray:

def recarr_to_ndarr(x,typ):

    fields = x.dtype.names
    shape = x.shape + (len(fields),)
    offsets = [x.dtype.fields[name][1] for name in fields]
    assert not any(np.diff(offsets, n=2))
    strides = x.strides + (offsets[1] - offsets[0],)
    y = np.ndarray(shape=shape, dtype=typ, buffer=x,
               offset=offsets[0], strides=strides)
    return y

However, I get the following error: 但是,我收到以下错误:

In [254]: recarr_to_ndarr(y,'<f4')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-65-2ebda2a39e9f> in <module>()
----> 1 recarr_to_ndarr(y,'<f4')

<ipython-input-62-8a9eea8e7512> in recarr_to_ndarr(x, typ)
      8     strides = x.strides + (offsets[1] - offsets[0],)
      9     y = np.ndarray(shape=shape, dtype=typ, buffer=x,
---> 10                offset=offsets[0], strides=strides)
     11     return y
     12 

TypeError: expected a single-segment buffer object

The function works fine if I create a copy: 如果我创建一个副本,该函数可以正常工作:

In [255]: recarr_to_ndarr(np.array(y),'<f4')
Out[255]: 
array([[  2.00000000e+00,  -1.00000000e+09],
       [  2.00000000e+00,   4.00000000e+02],
       [  2.00000000e+00,   8.04846000e+05],
       [  2.00000000e+00,   8.00000000e+02],
       [  5.00000000e+00,   9.00000000e+02],
       [  5.00000000e+00,   1.00000000e+03],
       [  5.00000000e+00,   8.90000000e+03],
       [  5.00000000e+00,   1.14000000e+04],
       [  3.00000000e+00,   1.45000000e+04],
       [  3.00000000e+00,   4.05500000e+04],
       [  3.00000000e+00,   4.09900000e+04],
       [  3.00000000e+00,   4.44000000e+04]], dtype=float32)

There seems to be no difference between the two arrays: 这两个数组之间似乎没有区别:

In [66]: y
Out[66]: 
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
       (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
       (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
      dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})

In [67]: np.array(y)
Out[67]: 
array([(2.0, -1000000000.0), (2.0, 400.0), (2.0, 804846.0), (2.0, 800.0),
       (5.0, 900.0), (5.0, 1000.0), (5.0, 8900.0), (5.0, 11400.0),
       (3.0, 14500.0), (3.0, 40550.0), (3.0, 40990.0), (3.0, 44400.0)], 
      dtype={'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})

This answer is a bit long and rambling. 这个答案有点漫长而漫无边际。 I started with what I knew from previous work on taking array views, and then tried to relate that to your functions. 我从之前关于数组视图的工作中了解到的,然后尝试将其与您的函数联系起来。

================ ================

In your case, all fields are 4 bytes long, both floats and ints. 在您的情况下,所有字段都是4个字节长,包括浮点数和整数。 I can then view it as all ints or all floats: 然后我可以将其视为所有整数或所有浮点数:

In [1431]: x
Out[1431]: 
array([(22, 2.0, -1000000000.0, 2000), (22, 2.0, 400.0, 2000),
       (22, 2.0, 804846.0, 2000), (44, 2.0, 800.0, 4000),
       (55, 5.0, 900.0, 5000), (55, 5.0, 1000.0, 5000),
       (55, 5.0, 8900.0, 5000), (55, 5.0, 11400.0, 5000),
       (33, 3.0, 14500.0, 3000), (33, 3.0, 40550.0, 3000),
       (33, 3.0, 40990.0, 3000), (33, 3.0, 44400.0, 3000)], 
      dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])
In [1432]: x.view('i4')
Out[1432]: 
array([        22, 1073741824, -831624408,       2000,         22,
       1073741824, 1137180672,       2000,         22, 1073741824,
       1229225696,       2000,         44, 1073741824, 1145569280,
      ....     3000])
In [1433]: x.view('f4')
Out[1433]: 
array([  3.08285662e-44,   2.00000000e+00,  -1.00000000e+09,
         2.80259693e-42,   3.08285662e-44,   2.00000000e+00,
  ....   4.20389539e-42], dtype=float32)

This view is 1d. 这个观点是1d。 I can reshape and slice the 2 float columns 我可以重塑和切割2个浮动列

In [1434]: x.shape
Out[1434]: (12,)
In [1435]: x.view('f4').reshape(12,-1)
Out[1435]: 
array([[  3.08285662e-44,   2.00000000e+00,  -1.00000000e+09,
          2.80259693e-42],
       [  3.08285662e-44,   2.00000000e+00,   4.00000000e+02,
          2.80259693e-42],
         ...
       [  4.62428493e-44,   3.00000000e+00,   4.44000000e+04,
          4.20389539e-42]], dtype=float32)

In [1437]: x.view('f4').reshape(12,-1)[:,1:3]
Out[1437]: 
array([[  2.00000000e+00,  -1.00000000e+09],
       [  2.00000000e+00,   4.00000000e+02],
       [  2.00000000e+00,   8.04846000e+05],
       [  2.00000000e+00,   8.00000000e+02],
       ...
       [  3.00000000e+00,   4.44000000e+04]], dtype=float32)

That this is a view can be verified by doing a bit of inplace math, and seeing the results in x : 这是一个视图可以通过做一些inplace数学,并在x看到结果来验证:

In [1439]: y=x.view('f4').reshape(12,-1)[:,1:3]
In [1440]: y[:,0] += .5
In [1441]: y
Out[1441]: 
array([[  2.50000000e+00,  -1.00000000e+09],
       [  2.50000000e+00,   4.00000000e+02],
       ...
       [  3.50000000e+00,   4.44000000e+04]], dtype=float32)
In [1442]: x
Out[1442]: 
array([(22, 2.5, -1000000000.0, 2000), (22, 2.5, 400.0, 2000),
       (22, 2.5, 804846.0, 2000), (44, 2.5, 800.0, 4000),
       (55, 5.5, 900.0, 5000), (55, 5.5, 1000.0, 5000),
       (55, 5.5, 8900.0, 5000), (55, 5.5, 11400.0, 5000),
       (33, 3.5, 14500.0, 3000), (33, 3.5, 40550.0, 3000),
       (33, 3.5, 40990.0, 3000), (33, 3.5, 44400.0, 3000)], 
      dtype=[('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4')])

If the field sizes differed this might be impossible. 如果字段大小不同,则可能无法实现。 For example if the floats were 8 bytes. 例如,如果浮点数是8个字节。 The key is picturing how the structured data is stored, and imagining whether that can be viewed as a simple dtype of multiple columns. 关键是描绘结构化数据的存储方式,并想象是否可以将其视为多列的简单dtype。 And field choice has to be equivalent to a basic slice. 字段选择必须等同于基本切片。 Working with ['f1','f4'] would be equivalent to advanced indexing with [:,[0,3], which has to be a copy. 使用['f1','f4']等同于使用[:,[0,3]进行高级索引,它必须是副本。

========== ==========

The 'direct' field indexing is: “直接”字段索引是:

z = x[['f2','f3']].view('f4').reshape(12,-1)
z -= .5

modifies z but with a futurewarning . 修改z但使用futurewarning Also it does not modify x ; 它也不会修改x ; z has become a copy. z已成为副本。 I can also see this by looking at z.__array_interface__['data'] , the data buffer location (and comparing with that of x and y ). 我也可以通过查看z.__array_interface__['data'] ,数据缓冲区位置(并与xy位置进行比较)来看到这一点。

================= =================

Your fields_view does create a structured view: 您的fields_view确实会创建一个结构化视图:

In [1480]: w=fields_view(x,['f2','f3'])
In [1481]: w.__array_interface__['data']
Out[1481]: (151950184, False)
In [1482]: x.__array_interface__['data']
Out[1482]: (151950184, False)

which can be used to modify x , w['f2'] -= .5 . 可以用来修改xw['f2'] -= .5 So it is more versatile than the 'direct' x[['f2','f3']] . 所以它比'直接' x[['f2','f3']]更通用。

The w dtype is w dtype是

dtype({'names':['f2','f3'], 'formats':['<f4','<f4'], 'offsets':[4,8], 'itemsize':12})

Adding print(shape, typ, offsets, strides) to your recarr_to_ndarr , I get (py3) print(shape, typ, offsets, strides)recarr_to_ndarr ,我得到(py3)

In [1499]: recarr_to_ndarr(w,'<f4')
(12, 2) <f4 [4, 8] (16, 4)
....
ValueError: ndarray is not contiguous

In [1500]: np.ndarray(shape=(12,2), dtype='<f4', buffer=w.data, offset=4, strides=(16,4))
...
BufferError: memoryview: underlying buffer is not contiguous

That contiguous problem must be refering to the values shown in w.flags : 这个contiguous问题必须w.flags显示的值:

In [1502]: w.flags
Out[1502]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  ....

It's interesting that w.dtype.descr converts the 'offsets' into a unnamed field: 有趣的是, w.dtype.descr将'offsets'转换为未命名的字段:

In [1506]: w.__array_interface__
Out[1506]: 
{'data': (151950184, False),
 'descr': [('', '|V4'), ('f2', '<f4'), ('f3', '<f4')],
 'shape': (12,),
 'strides': (16,),
 'typestr': '|V12',
 'version': 3}

One way or other, w has a non-contiguous data buffer, which can't be used to create a new array. 无论如何, w具有非连续数据缓冲区,不能用于创建新数组。 Flattened, the data buffer looks something like 扁平化,数据缓冲区看起来像

xoox|xoox|xoox|...   
# x 4 bytes we want to skip
# o 4 bytes we want to use
# | invisible bdry between records in x

The y I constructed above has: 我上面构建的y有:

In [1511]: y.__array_interface__
Out[1511]: 
{'data': (151950188, False),
 'descr': [('', '<f4')],
 'shape': (12, 2),
 'strides': (16, 4),
 'typestr': '<f4',
 'version': 3}

So it accesses the o bytes with a 4 byte offset, and then (16,4) strides, and (12,2) shape. 因此它以4字节偏移量访问o字节,然后(16,4)步长,并且(12,2)形状。

If I modify your ndarray call to use the original x.data , it works: 如果我修改你的ndarray调用以使用原始的x.data ,它可以工作:

In [1514]: xx=np.ndarray(shape=(12,2), dtype='<f4', buffer=x.data, offset=4, strides=(16,4))
In [1515]: xx
Out[1515]: 
array([[  2.00000000e+00,  -1.00000000e+09],
       [  2.00000000e+00,   4.00000000e+02],
           ....
       [  3.00000000e+00,   4.44000000e+04]], dtype=float32)

with the same array_interface as my y : 使用与y相同的array_interface:

In [1516]: xx.__array_interface__
Out[1516]: 
{'data': (151950188, False),
 'descr': [('', '<f4')],
 'shape': (12, 2),
 'strides': (16, 4),
 'typestr': '<f4',
 'version': 3}

hpaulj was right in saying that the problem is that the subset of the structured array is not contiguous. hpaulj表示问题在于结构化数组的子集不是连续的。 Interestingly, I figured out a way to make the array subset contiguous with the following function: 有趣的是,我想出了一种方法,使数组子集与以下函数连续:

  def view_fields(a, fields):
        """
        `a` must be a numpy structured array.
        `names` is the collection of field names to keep.

        Returns a view of the array `a` (not a copy).
        """
        dt = a.dtype
        formats = [dt.fields[name][0] for name in fields]
        offsets = [dt.fields[name][1] for name in fields]
        itemsize = a.dtype.itemsize
        newdt = np.dtype(dict(names=fields,
                              formats=formats,
                              offsets=offsets,
                              itemsize=itemsize))
        b = a.view(newdt)
        return b

In [5]: view_fields(x,['f2','f3']).flags
Out[5]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

The old function: 旧功能:

In [10]: fields_view(x,['f2','f3']).flags
Out[10]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM