简体   繁体   English

如何为k最近邻分类创建置信度估计的颜色图

[英]How to create colormap of confidence estimates for k-Nearest Neighbor Classification

What I want: 我想要的是:

To display the results of my simple classification algorithm (see below) as a colormap in python (the data is in 2D), where each class is assigned a color, and the confidence of a prediction anywhere on the 2D map is proportional to the saturation of the color associated with the class prediction. 要将我的简单分类算法的结果(如下所示)显示为python中的颜色图(数据为2D),其中为每个类别分配了一种颜色,并且2D地图上任意位置的预测可信度与饱和度成比例与类别预测相关的颜色。 The image below sort of illustrates what I want for a binary (two class problem) in which the red parts might suggest strong confidence in class 1, whereas blue parts would speak for class 2. The intermediate colors would suggest uncertainty about either. 下图显示了我对二元(两类问题)的要求,其中红色部分可能表示对1类有较强的信心,而蓝色部分可能表示对2类具有说服力。中间的颜色表明这两种类型都存在不确定性。 Obviously I want the color scheme to generalize to multiple classes, so I would need many colors and the scale would then go from white (uncertainty) to very colorful color associated with a class. 显然,我希望配色方案可以推广到多个类别,因此我将需要多种颜色,然后比例会从白色(不确定性)变为与某个类别相关的非常鲜艳的颜色。

illustration http://www.nicolacarlon.it/out.png 插图http://www.nicolacarlon.it/out.png

Some Sample Code: 一些示例代码:

My sample code just uses a simple kNN algorithm where the nearest k data points are allowed to 'vote' on the class of a new point on the map. 我的示例代码仅使用一种简单的kNN算法,其中允许最近的k个数据点对地图上新点的类进行“投票”。 The confidence of the prediction is simply given by relative frequency of the winning class, out of the k which voted. 预测的置信度仅由获胜类别的相对频率(在投票的k中)给出。 I haven't dealt with ties and I know there are better probabilistic versions of this method, but all I want is to visualize my data to show a viewer the chances of a class being in a particular part of the 2D plane. 我还没有处理平局,我知道这种方法有更好的概率版本,但是我想要的只是可视化我的数据,以向查看者显示类在2D平面的特定部分的机会。

import numpy as np
import matplotlib.pyplot as plt


# Generate some training data from three classes
n = 100 # Number of covariates (sample points) for each class in training set. 
mean1, mean2, mean3 = [-1.5,0], [1.5, 0], [0,1.5]
cov1, cov2, cov3 = [[1,0],[0,1]], [[1,0],[0,1]], [[1,0],[0,1]]
X1 = np.asarray(np.random.multivariate_normal(mean1,cov1,n))
X2 = np.asarray(np.random.multivariate_normal(mean2,cov2,n))
X3 = np.asarray(np.random.multivariate_normal(mean3,cov3,n))


plt.plot(X1[:,0], X1[:,1], 'ro', X2[:,0], X2[:,1], 'bo', X3[:,0], X3[:,1], 'go' )

plt.axis('equal'); plt.show() #Display training data


# Prepare the data set as a 3n*3 array where each row is a data point and its associated class
D = np.zeros((3*n,3))
D[0:n,0:2] = X1; D[0:n,2] = 1
D[n:2*n,0:2] = X2; D[n:2*n,2] = 2
D[2*n:3*n,0:2] = X3; D[2*n:3*n,2] = 3

def kNN(x, D, k=3):
    x = np.asarray(x)
    dist = np.linalg.norm(x-D[:,0:2], axis=1)
    i = dist.argsort()[:k] #Return k indices of smallest to highest entries
    counts = np.bincount(D[i,2].astype(int))
    predicted_class = np.argmax(counts) 
    confidence = float(np.max(counts))/k
    return predicted_class, confidence 

print(kNN([-2,0], D, 20))

So, you can calculate two numbers for each point in the 2D plane 因此,您可以为2D平面中的每个点计算两个数字

  • confidence (0 .. 1) 置信度(0 .. 1)
  • class (an integer) 类(整数)

One possibility is to calculate your own RGB map and show it with imshow . 一种可能性是计算您自己的RGB贴图,并使用imshow显示。 Like this: 像这样:

import numpy as np
import matplotlib.pyplot as plt

# color vector with N x 3 colors, where N is the maximum number of classes and the colors are in RGB
mycolors = np.array([
  [ 0, 0, 1],
  [ 0, 1, 0],
  [ 1, 0, 1],
  [ 1, 1, 0],
  [ 0, 1, 1],
  [ 0, 0, 0],
  [ 0, .5, 1]])

# negate the colors
mycolors = 1 - mycolors 

# extents of the area
x0 = -2
x1 = 2
y0 = -2
y1 = 2

# grid over the area
X, Y = np.meshgrid(np.linspace(x0, x1, 1000), np.linspace(y0, y1, 1000))

# calculate the classification and probabilities
classes = classify_func(X, Y)
probabilities = prob_func(X, Y)

# create the basic color map by the class
img = mycolors[classes]

# fade the color by the probability (black for zero prob)
img *= probabilities[:,:,None]

# reverse the negative image back
img = 1 - img

# draw it
plt.imshow(img, extent=[x0,x1,y0,y1], origin='lower')
plt.axis('equal')

# save it
plt.savefig("mymap.png")

The trick of making negative colors is there just to make the maths a bit easier to undestand. 制作负色的技巧是使数学更容易理解。 The code can of course be written much denser. 代码当然可以写得更密集。

I created two very simple functions to mimic the classification and probabilities: 我创建了两个非常简单的函数来模拟分类和概率:

def classify_func(X, Y):
    return np.round(abs(X+Y)).astype('int')

def prob_func(X,Y):
    return 1 - 2*abs(abs(X+Y)-classify_func(X,Y))

The former gives for the given area integer values from 0 to 4, and the latter gives smoothly changing probabilities. 对于给定的面积,前者给出从0到4的整数值,而后者给出平稳变化的概率。

The result: 结果:

在此处输入图片说明

If you do not like the way the colors fade towards zero probability, you may always create some non-linearity which is the applied when multiplying with the probabilities. 如果您不喜欢颜色逐渐趋于零概率的方式,则可以始终创建一些非线性度,将其与概率相乘。


Here the functions classify_func and prob_func are given two arrays as the arguments, first one being the X coordinates where the values are to be calculated, and second one Y coordinates. 在这里,函数classify_funcprob_func给出了两个数组作为参数,第一个数组是要计算值的X坐标,第二个是Y坐标。 This works well, if the underlying calculations are fully vectorized. 如果基础计算已完全矢量化,则此方法效果很好。 With the code in the question this is not the case, as it only calculates single values. 对于问题中的代码,情况并非如此,因为它仅计算单个值。

In that case the code changes slightly: 在这种情况下,代码会稍有变化:

x = np.linspace(x0, x1, 1000)
y = np.linspace(y0, y1, 1000)
classes = np.empty((len(y), len(x)), dtype='int')
probabilities = np.empty((len(y), len(x)))
for yi, yv in enumerate(y):
    for xi, xv in enumerate(x):
    classes[yi, xi], probabilities[yi, xi] = kNN((xv, yv), D)

Also as your confidence estimates are not 0..1, they need to be scaled: 同样,由于您的置信度估计值不是0..1,因此需要对它们进行缩放:

probabilities -= np.amin(probabilities)
probabilities /= np.amax(probabilities)

After this is done, your map should look like this with extents -4,-4..4,4 (as per the color map: green=1, magenta=2, yellow=3): 完成此操作后,您的地图应该看起来像这样,范围为-4,-4..4,4(根据颜色图:绿色= 1,品红色= 2,黄色= 3):

kNN图


To vectorize or not to vectorize - that is the question 向量化还是不向量化-这就是问题

This question pops up from time to time. 这个问题有时会弹出。 There is a lot of information about vectorization in the web, but as a quick search did not reveal any short summaries, I'll give some thoughts here. 网络上有很多有关矢量化的信息,但是由于快速搜索没有发现任何简短的摘要,因此我在这里给出一些想法。 This is quite a subjective matter, so everything just represents my humble opinions. 这是一个主观的问题,因此所有内容都代表我的拙见。 Other people may have different opinions. 其他人可能有不同的意见。

There are three factors to consider: 要考虑三个因素:

  • performance 性能
  • legibility 易读性
  • memory use 内存使用

Usually (but not always) vectorization makes code faster, more difficult to understand, and consume more memory. 通常(但并非总是),向量化可使代码更快,更难以理解,并占用更多内存。 Memory use is not usually a big problem, but with large arrays it is something to think of (hundreds of megs are usually ok, gigabytes are troublesome). 内存使用通常不是一个大问题,但是对于大型数组来说,这是一个值得思考的问题(通常可以使用数百兆,而麻烦的是千兆字节)。

Trivial cases aside (element-wise simple operations, simple matrix operations), my approach is: 除了琐碎的情况(元素方式的简单操作,简单的矩阵操作),我的方法是:

  • write the code without vectorizations and check it works 编写没有向量化的代码并检查其是否有效
  • profile the code 分析代码
  • vectorize the inner loops if needed and possible (1D vectorization) 如果需要并可能的话,对内部循环进行矢量化处理(一维矢量化)
  • create a 2D vectorization if it is simple 如果简单,则创建2D矢量化

For example, a pixel-by-pixel image processing operation may lead to a situation where I end up with one-dimensional vectorizations (for each row). 例如,逐个像素的图像处理操作可能会导致我最终以一维矢量化(针对每一行)的情况出现。 Then the inner loop (for each pixel) is fast, and the outer loop (for each row) does not really matter. 然后,内循环(用于每个像素)很快,而外循环(用于每一行)并不重要。 The code may look much simpler if it does not try to be usable with all possible input dimensions. 如果该代码未尝试在所有可能的输入尺寸中使用,则看起来可能会简单得多。

I am such a lousy algorithmist that in more complex cases I like to verify my vectorized code against the non-vectorized versions. 我是一个糟糕的算法专家,在更复杂的情况下,我想对照非矢量化版本来验证我的矢量化代码。 Hence I almost invariably first create the non-vectorized code before optimizing it at all. 因此,我几乎总是先创建非矢量化的代码,然后再对其进行优化。

Sometimes vectorization does not offer any performance benefit. 有时矢量化不会提供任何性能优势。 For example, the handy function numpy.vectorize can be used to vectorize practically any function, but its documentation states: 例如,方便的函数numpy.vectorize可用于向量化任何函数,但其​​文档指出:

The vectorize function is provided primarily for convenience, not for performance. 提供矢量化功能主要是为了方便,而不是为了提高性能。 The implementation is essentially a for loop. 该实现实质上是一个for循环。

(This function could have been used in the code above, as well. I chose the loop version for legibility for people not very familiar with numpy .) (该函数也可以在上面的代码中使用。出于对易懂numpy不太熟悉的人的考虑,我选择了循环版本。)

Vectorization gives more performance only if the underlying vectorized functions are faster. 仅当基础矢量化功能更快时,矢量化才能提供更高的性能。 They sometimes are, sometimes aren't. 他们有时是,有时不是。 Only profiling and experience will tell. 只有剖析和经验会证明一切。 Also, it is not always necessary to vectorize everything. 同样,并非总是必须对所有内容进行矢量化处理。 You may have an image processing algorithm which has both vectorized and pixel-by-pixel operations. 您可能具有同时进行矢量化和逐像素运算的图像处理算法。 There numpy.vectorize is very useful. 那里有numpy.vectorize非常有用。

I would try to vectorize the kNN search algorithm above at least to one dimension. 我会尝试将至少k维以上的kNN搜索算法向量化。 There is no conditional code (it wouldn't be a show-stopper but it would complicates things), and the algorithm is rather straight-forward. 没有条件代码(它不会成为秀场停止者,但会使事情变得复杂),并且该算法相当简单。 The memory consumption will go up, but with one-dimensional vectorization it does not matter. 内存消耗将增加,但是使用一维矢量化没关系。

And it may happen that along the way you notice that a n-dimensional generalization is not much more complicated. 沿途您可能会发现n维概括并不复杂。 Then do that if memory allows. 然后在内存允许的情况下执行此操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM