简体   繁体   English

在每个像元中创建具有相同数量点的2D历史

[英]Create 2D Hist with same amount of points in each cell

I am looking for a possibility to create a 2D histogram with irregular bin sizes with the possibility of plotting heat as the z variable. 我正在寻找一种创建具有不规则箱尺寸的2D直方图的可能性,并可能将热量绘制为z变量。

The data: I have one billion objects. 数据:我有十亿个对象。 Every object has the features x, y and an anomaly score of z. 每个对象都具有特征x,y和z的异常分数。

The plot: Plotted are all objects with y against x. 绘图:绘制的是所有y对x的对象。 The histogram should have irregular (adaptive) bin sizes, so that in every bin created lie the same amount of objects. 直方图应具有不规则的(自适应)容器大小,以便在创建的每个容器中放置相同数量的对象。 This should initially create a histogram without any visible traits, having just one color (with color representing the count of objects). 最初应该创建一个没有任何可见特征的直方图,只有一种颜色(颜色代表对象的数量)。

To create the bin edges I firstly use np.percentiles and separate the objects based on the x feature into percentiles. 为了创建仓边缘,我首先使用np.percentiles并将基于x特征的对象分离为百分位数。 Secondly, I use the first x binedge, find all the points within it and bin them in the y direction based on percentiles. 其次,我使用第一个x binedge,找到其中的所有点,然后根据百分位数在y方向上将它们分类。 That would look something like this (pseudocode): 看起来像这样(伪代码):

for i, key_x in enumerate(np.percentile(x, np.arange(0,101, 10))):
    xedges[i] = key_x
    objects = find_all_objects_within_binedge(key_x)

    for j, key_y in enumerate(np.percentile(objects["y"], np.arange(0,101, 10))):
        yedges[i, j] = key_y

So xedges is an array with the binedges in x direction and yedges is a matrix giving me the y binedges for every x binedge. 因此,xedges是在x方向具有binedges的数组,而yedges是一个矩阵,它为我提供了每个x binedge的y binedges。 If this is not understandable please let me know. 如果无法理解,请告诉我。

So if we imagine the histogram that would result, we would have straight binning lines in x. 因此,如果我们想象将要产生的直方图,我们将在x中具有直线装仓线。 But in the y direction these lines would be split. 但是在y方向上,这些线将被分开。 See here to get an idea of what I mean with the y bins being irregular split. 请参阅此处了解y箱不规则拆分的含义。

And this is were I am stuck. 这就是我被困住的地方。 I have no idea how to create a histogram or plot from my x-binedges and y-binedges with these irregular bins. 我不知道如何使用这些不规则的箱从我的x-binedges和y-binedges创建直方图或绘图。

The goal (for better understanding): Once that is accomplished, I would like to be able to have each bin colored by the mean or std of all the points within that cell using the z values (have the code for that ready). 目标(为了更好地理解):完成此操作后,我希望能够使用z值用该单元格中所有点的均值或标准差对每个bin进行着色(准备好了代码)。 Ideally this will look very smooth as well, with some minor exceptions, which would be anomalous and what I am looking for. 理想情况下,这看起来也非常平滑,除了一些小例外,这是异常的,也是我要寻找的东西。 But this should be feasible with plt.pcolormesh. 但这对于plt.pcolormesh应该是可行的。

English is not my native language and I tried my best to describe the problem. 英语不是我的母语,所以我尽力描述问题。 If something is unclear please let me know and I'll try to clarify as best as possible. 如果有不清楚的地方,请告诉我,我会尽力澄清。 Thank you guys in advance :) 预先谢谢你们:)

From what I understand you want the data to be binned based on equal amounts of data in the bin. 据我了解,您希望基于仓中相等数量的数据对数据进行仓位。 Indeed percentiles can be used for this purpose. 实际上,百分位数可用于此目的。 If you use numpy you can do this along d dimensions. 如果使用numpy,则可以沿d维执行此操作。 Here is an example for 2d binning: 这是二维装箱的示例:

import matplotlib.pyplot as plt
from numpy import array, random, percentile

data = random.randn(1000, 2)
data[:, 1] = data[:, 1] * .1 + 1 # shift the gauss


percentiles = percentile(data, range(0, 100, 10), axis = 0)

fig, ax = plt.subplots()
ax.hist2d(*data.T, bins = percentiles.T)
fig.show()

Is this what you were looking for? 这是您要找的东西吗?

Edit: non-uniform grid example 编辑:非均匀网格示例

import matplotlib.pyplot as plt
from numpy import *
data = random.randn(1000, 2)
data[:, 1] = data[:, 1] * .1 + 1 # shift the gauss

xper = percentile(data[:, 0], range(0, 101, 10))
yper = zeros((xper.size, xper.size))

binnedData = ones(yper.shape)
for index, (binstart, binend) in enumerate(zip(xper[:-1], xper[1:])):
    idx = where(logical_and(data[:, 0] >= binstart, data[:, 0] <= binend))[0] # expensive
    yper[index] = percentile(data[idx, 1], range(0, 101, 10))
    for jndex, j in  enumerate(digitize(data[idx, 1], yper[index])):
        j -= 1 #digit takes right bins
        # generate dummy values
        binnedData[index, j] += data[idx[j], :].sum() /  xper.size
fig, ax = plt.subplots()
ax.pcolormesh(xper, yper, binnedData)

非单

It seems the question asks for a way to plot values on a grid, which is regular in one dimension, but irregular in the other. 似乎该问题要求一种在网格上绘制值的方法,该网格在一维上是规则的,而在另一维上是不规则的。
As I understand it such grid would be defined by a 1D array in eg x-direction, and a 2D array in y-direction. 据我了解,这样的网格将由x方向的1D数组和y方向的2D数组定义。 Both arrays would denote the edges of the grid cells in the respective dimension. 两个阵列都将表示相应维度中的网格单元的边缘。

For a M x N grid, x_edges would hence have N+1 elements, and y_edges would be of shape (M+1, N) . 对于M x N网格, x_edges将因此具有N+1元素,而y_edges将具有(M+1, N)的形状。 The following would be a 4 x 3 grid. 以下是4 x 3的网格。

x_edges = np.array([0,1,2,3])
y_edges = np.array([[0.,0.,0.],
                    [.3,.2,.2],
                    [.5,.6,.4],
                    [.8,.9,.7],
                    [1.,1.,1.]])

The usual matplotlib tools like imshow or pcolor do - as far as I can see - not allow to plot such grids. 据我所知,像imshowpcolor这样的常规matplotlib工具无法绘制此类网格。 An alternative is hence to use a PolyCollection and plot the respective rectangles with it. 因此,一种替代方法是使用PolyCollection并绘制相应的矩形。

An array of values that shall be mapped to color can be set to that collection. 可以将映射到颜色的值数组设置为该集合。 This array should have one value less per dimension and be flat, ie have M*N elements. 此数组的每个尺寸应少一个值,并且应平坦,即具有M * N个元素。

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import PolyCollection

# Starting data: A grid, regular in x-direction and irregular in y direction.
x_edges = np.array([0,1,2,3])
y_edges = np.array([[0.,0.,0.],
                    [.3,.2,.2],
                    [.5,.6,.4],
                    [.8,.9,.7],
                    [1.,1.,1.]])

######## Grid creation ################
#y_edges = np.concatenate((y_edges, np.zeros(len(y_edges))))
s = np.array(y_edges.shape)
# make x_edges 2D as well.
x_edges = np.tile(x_edges, s[0]-1).reshape((s[0]-1, s[1]+1))

# you may also have an array of values. 
# This should be of shape one less than the edges and flattened.
values = np.arange(np.prod(s+np.array((-1,0))))

# Produce a vertices array of the edges of rectangles that form each pixel.
x = np.c_[x_edges[:,:-1].flatten(), x_edges[:,:-1].flatten(),
          x_edges[:,1: ].flatten(), x_edges[:,1: ].flatten()]
y = np.c_[y_edges[:-1,:].flatten(), y_edges[1: ,:].flatten(),
          y_edges[1: ,:].flatten(), y_edges[:-1,:].flatten()]
xy = np.stack((x,y), axis=2)

# Create collection of rectangles.
pc = PolyCollection(xy, closed=True, edgecolors="k", linewidth=0.72, cmap="inferno")
pc.set_array(values)

######## Plotting ################
fig, ax = plt.subplots()
ax.add_collection(pc)
fig.colorbar(pc, ax=ax)

ax.margins(0)
ax.autoscale()
plt.show()

在此处输入图片说明

This grid uses a small number of cells to show the principle. 该网格使用少量单元格来显示原理。 If you want to have more cells, make sure not to plot the edges of rectangles by removing the edgecolors and linewidth arguments. 如果要具有更多的单元格,请确保不要通过删除edgecolorslinewidth参数来绘制矩形的边缘。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM