“最近”插值方法的 NumPy 百分位数和 TensorFlow 百分位数的不同结果

Question

我注意到，即使 NumPy 的numpy.percentile和 TensorFlow Probability 的tfp.stats.percentile对他们的“最近”插值方法给出了相同的文档字符串解释

此可选参数指定当所需百分位数位于两个数据点i < j之间时要使用的插值方法：

...

'最近的'： i或j ，以最近者为准。

他们给出不同的结果。 下面是我的意思的一个最小的工作示例。

环境

$ "$(which python3)" --version
Python 3.7.5
$ python3 -m venv "${HOME}/.venvs/question"
$ . "${HOME}/.venvs/question/bin/activate"
(question) $ cat requirements.txt
numpy~=1.18
tensorflow~=2.1
tensorflow-probability~=0.9
black
(question) $ python -m pip install -r requirements.txt

代码

# question.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp


def main():
    a = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
    q = 50
    print(f"Flattened array: {a.flatten()}")
    print("NumPy:")
    print(f"\t{q}th percentile (linear): {np.percentile(a, q, interpolation='linear')}")
    print(
        f"\t{q}th percentile (nearest): {np.percentile(a, q, interpolation='nearest')}"
    )

    b = tf.convert_to_tensor(a)
    print("TensorFlow:")
    print(
        f"\t{q}th percentile (linear): {tfp.stats.percentile(b, q, interpolation='linear')}"
    )
    print(
        f"\t{q}th percentile (nearest): {tfp.stats.percentile(b, q, interpolation='nearest')}"
    )


if __name__ == '__main__':
    main()

当运行时，“最近”插值方法会给出不同的结果

(question) $ python question.py
Flattened array: [10.  7.  4.  3.  2.  1.]
NumPy:
    50th percentile (linear): 3.5
    50th percentile (nearest): 3.0
TensorFlow:
    50th percentile (linear): 3.5
    50th percentile (nearest): 4.0

在浏览了 function 的NumPy v1.18.2 源代码之后， numpy.percentile正在调用我仍然对为什么感到困惑。 这似乎是由于四舍五入的决定（假设NumPy 使用numpy.around和TFP 使用tf.round ）。

有人可以向我解释导致差异的原因吗？ 我想为这些函数制作一个垫片，但我需要了解返回行为。

Answer 1

遍历两者的来源，似乎它不是像我第一次那样的舍入问题，而是numpy.percentile在升序排序的 ndarray 上进行最终评估，而tfp.stats.percentile在降序排序的张量上进行。

# answer.py
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
from tensorflow_probability.python.internal import tensorshape_util
from tensorflow_probability.python.internal import distribution_util


def numpy_src(input, q, axis=0, out=None):
    a = input
    q = np.true_divide(q, 100)  # 0.5
    q = np.asanyarray(q)  # array(0.5)
    q = q[None]  # array([0.5])
    ap = a.flatten()  # array([10.,  7.,  4.,  3.,  2.,  1.])
    Nx = ap.shape[axis]  # 6
    indices = q * (Nx - 1)  # array([2.5])
    indices = np.around(indices).astype(np.intp)  # array([2])
    ap.partition(indices, axis=axis)  # array([ 1.,  2.,  3.,  4.,  7., 10.])
    indices = indices[0]  # 2
    r = np.take(ap, indices, axis=axis, out=out)  # 3.0
    print(f"Result of np.percentile source: {r}")


def tensorflow_src(input, q=50, axis=None):
    x = input
    name = "percentile"
    interpolation = "nearest"
    q = tf.cast(q, tf.float64)  # tf.Tensor(50.0, shape=(), dtype=float64)
    if axis is None:
        y = tf.reshape(
            x, [-1]
        )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64)
    frac_at_q_or_above = 1.0 - q / 100.0  # tf.Tensor(0.5, shape=(), dtype=float64)
    # _sort_tensor(y)
    # N.B. Here is the difference. Note the sort order is never changed
    sorted_y, _ = tf.math.top_k(
        y, k=tf.shape(y)[-1]
    )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64), _
    tensorshape_util.set_shape(
        sorted_y, y.shape
    )  # tf.Tensor([10.  7.  4.  3.  2.  1.], shape=(6,), dtype=float64)
    d = tf.cast(tf.shape(y)[-1], tf.float64)  # tf.Tensor(6.0, shape=(), dtype=float64)
    # _get_indices(interpolation)
    indices = tf.round(
        (d - 1) * frac_at_q_or_above
    )  # tf.Tensor(2.0, shape=(), dtype=float64)
    indices = tf.clip_by_value(
        tf.cast(indices, tf.int32), 0, tf.shape(y)[-1] - 1
    )  # tf.Tensor(2, shape=(), dtype=int32)
    # N.B. The sort order here is descending, causing a difference
    gathered_y = tf.gather(
        sorted_y, indices, axis=-1
    )  # tf.Tensor(4.0, shape=(), dtype=float64)
    result = distribution_util.rotate_transpose(gathered_y, tf.rank(q))  # 4.0
    print(f"Result of tf.percentile source: {result}")


def main():
    np_in = np.array([[10.0, 7.0, 4.0], [3.0, 2.0, 1.0]])
    numpy_src(np_in, q=50)
    tf_in = tf.convert_to_tensor(np_in)
    tensorflow_src(tf_in, q=50)


if __name__ == "__main__":
    main()

运行时给出

$ python answer.py 
Result of np.percentile source: 3.0
Result of tf.percentile source: 4.0

相反，如果在 TensorFlow 概率的percentile中添加以下内容，以使评估的排序顺序升序

sorted_y = tf.reverse(
    sorted_y, [-1]
)  # tf.Tensor([ 1.  2.  3.  4.  7. 10.], shape=(6,), dtype=float64)

那么这两个结果将是相同的

$ python answer.py 
Result of np.percentile source: 3.0
Result of tf.percentile source: 3.0

鉴于 TensorFlow Probability 的文档字符串说

给定一个向量x ， x 的第q个百分位数是x的排序副本中从最小值到最大值的值x q / 100 。

这似乎是错误的，因为它正好相反。 我已经打开TensorFlow 概率问题 864来讨论这个问题。

“最近”插值方法的 NumPy 百分位数和 TensorFlow 百分位数的不同结果

问题描述

环境

代码

1 个解决方案

解决方案1
1 已采纳 2020-04-04 22:00:40

“最近”插值方法的 NumPy 百分位数和 TensorFlow 百分位数的不同结果

问题描述

环境

代码

1 个解决方案

解决方案1 1 已采纳 2020-04-04 22:00:40

解决方案1
1 已采纳 2020-04-04 22:00:40