如何正确地将numpy逻辑函数传递给Cython？

Question

What declarations should I be incorporating with a logic function / index operation so that Cython does the heavy lifting? 我应该将哪些声明与逻辑函数/索引操作合并在一起，以便Cython能够轻松完成任务？

I have two large rasters in the form of numpy arrays of equal size. 我有两个大小相等的numpy数组形式的大栅格。 The first array contains vegetation index values and the second array contains field IDs. 第一个数组包含植被索引值，第二个数组包含字段ID。 The goal is to average vegetation index values by field. 目标是按田地平均植被指数值。 Both arrays have pesky nodata values (-9999) that I would like to ignore. 这两个数组都有讨厌的nodata值（-9999），我想忽略它们。

Currently the function takes over 60 seconds to execute, which normally I wouldn't mind so much but I'll be processing potentially hundreds of images. 目前，该函数需要60秒钟才能执行，通常我不会介意那么多，但我将处理数百个图像。 Even a 30 second improvement would be significant. 甚至30秒的改善也是很重要的。 So I've been exploring Cython as a way to help speed things up. 因此，我一直在探索Cython，以帮助加快运行速度。 I've been using the Cython numpy tutorial as a guide. 我一直在使用Cython numpy教程作为指南。

Example data 示例数据

test_cy.pyx code: test_cy.pyx代码：

import numpy as np
cimport numpy as np
cimport cython
@cython.boundscheck(False) # turn off bounds-checking for entire function
@cython.wraparound(False)  # turn off negative index wrapping for entire function 

cpdef test():
  cdef np.ndarray[np.int16_t, ndim=2] ndvi_array = np.load("Z:cython_test/data/ndvi.npy")

  cdef np.ndarray[np.int16_t, ndim=2] field_array = np.load("Z:cython_test/data/field_array.npy")

  cdef np.ndarray[np.int16_t, ndim=1] unique_field = np.unique(field_array)
  unique_field = unique_field[unique_field != -9999]

  cdef int field_id
  cdef np.ndarray[np.int16_t, ndim=1] f_ndvi_values
  cdef double f_avg

  for field_id in unique_field :
      f_ndvi_values = ndvi_array[np.logical_and(field_array == field_id, ndvi_array != -9999)]
      f_avg = np.mean(f_ndvi_values)

Setup.py code: Setup.py代码：

try:
    from setuptools import setup
    from setuptools import Extension
except ImportError:
    from distutils.core import setup
    from distutils.extension import Extension

from Cython.Build import cythonize
import numpy

setup(ext_modules = cythonize('test_cy.pyx'),
      include_dirs=[numpy.get_include()])

After some researching and running: 经过研究和运行：

cython -a test_cy.pyx

It seems the index operation ndvi_array[np.logical_and(field_array == field_id, ndvi_array != -9999)] is the bottleneck and is still relying on Python. 看来索引操作ndvi_array[np.logical_and(field_array == field_id, ndvi_array != -9999)]是瓶颈，仍然依赖于Python。 I suspect I'm missing some vital declarations here. 我怀疑我在这里缺少一些重要的声明。 Including ndim didn't have any effect. 包括ndim并没有任何效果。

I'm fairly new to numpy as well so I'm probably missing something obvious. 我对numpy也相当陌生，所以我可能缺少明显的东西。

Answer 1

Your problem looks fairly vectorizable to me, so Cython might not be the best approach. 您的问题对我来说似乎可以解决，因此Cython可能不是最好的方法。 (Cython shines when there are unavoidable fine grained loops.) As your dtype is int16 there is only a limited range of possible labels, so using np.bincount should be fairly efficient. （当出现不可避免的细粒度循环时，Cython会发光。）由于int16为int16 ，因此可能的标签范围非常有限，因此使用np.bincount应该相当有效。 Try something like (this is assuming all your valid values are >= 0 if that is not the case you'd have to shift - or (cheaper) view-cast to uint16 (since we are not doing any arithmetic on the labels that should be safe) - before using bincount ): 尝试类似的操作（这是假设所有有效值都> = 0，如果不是这种情况，则不必将其移位-或（廉价）视图转换为uint16 （因为我们没有对应该安全）-在使用bincount之前）：

mask = (ndvi_array != -9999) & (field_array != -9999)
nd = ndvi_array[mask]
fi = field_array[mask]
counts = np.bincount(fi, minlength=2**15)
sums = np.bincount(fi, nd, minlength=2**15)
valid = counts != 0
avgs = sums[valid] / counts[valid]

如何正确地将numpy逻辑函数传递给Cython？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-02-17 20:58:24

如何正确地将numpy逻辑函数传递给Cython？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-02-17 20:58:24

解决方案1
1 已采纳 2018-02-17 20:58:24