简体   繁体   English

如何在MATLAB中标准化直方图?

[英]How to normalize a histogram in MATLAB?

如何对直方图进行归一化,以使概率密度函数下的面积等于1?

My answer to this is the same as in an answer to your earlier question . 我对此的回答与您对先前问题的回答相同。 For a probability density function, the integral over the entire space is 1 . 对于概率密度函数, 整个空间的积分为1 Dividing by the sum will not give you the correct density. 除以总和不会得到正确的密度。 To get the right density, you must divide by the area. 为了获得正确的密度,必须除以面积。 To illustrate my point, try the following example. 为了说明我的观点,请尝试以下示例。

[f, x] = hist(randn(10000, 1), 50); % Create histogram from a normal distribution.
g = 1 / sqrt(2 * pi) * exp(-0.5 * x .^ 2); % pdf of the normal distribution

% METHOD 1: DIVIDE BY SUM
figure(1)
bar(x, f / sum(f)); hold on
plot(x, g, 'r'); hold off

% METHOD 2: DIVIDE BY AREA
figure(2)
bar(x, f / trapz(x, f)); hold on
plot(x, g, 'r'); hold off

You can see for yourself which method agrees with the correct answer (red curve). 您可以自己查看哪种方法与正确答案(红色曲线)相符。

在此处输入图片说明

Another method (more straightforward than method 2) to normalize the histogram is to divide by sum(f * dx) which expresses the integral of the probability density function, ie 标准化直方图的另一种方法(比方法2更直接)是除以sum(f * dx) ,它表示概率密度函数的积分,即

% METHOD 3: DIVIDE BY AREA USING sum()
figure(3)
dx = diff(x(1:2))
bar(x, f / sum(f * dx)); hold on
plot(x, g, 'r'); hold off

Since 2014b, Matlab has these normalization routines embedded natively in the histogram function (see the help file for the 6 routines this function offers). 自2014b起,Matlab将这些归一化例程本机嵌入histogram函数中(有关此函数提供的6个例程,请参阅帮助文件 )。 Here is an example using the PDF normalization (the sum of all the bins is 1). 这是一个使用PDF归一化的示例(所有bin的总和为1)。

data = 2*randn(5000,1) + 5;             % generate normal random (m=5, std=2)
h = histogram(data,'Normalization','pdf')   % PDF normalization

The corresponding PDF is 对应的PDF是

Nbins = h.NumBins;
edges = h.BinEdges; 
x = zeros(1,Nbins);
for counter=1:Nbins
    midPointShift = abs(edges(counter)-edges(counter+1))/2;
    x(counter) = edges(counter)+midPointShift;
end

mu = mean(data);
sigma = std(data);

f = exp(-(x-mu).^2./(2*sigma^2))./(sigma*sqrt(2*pi));

The two together gives 两者一起给

hold on;
plot(x,f,'LineWidth',1.5)

在此处输入图片说明

An improvement that might very well be due to the success of the actual question and accepted answer! 改进很可能归因于实际问题和接受的答案的成功!


EDIT - The use of hist and histc is not recommended now, and histogram should be used instead. 编辑-现在不建议使用histhistc ,而应使用histogram Beware that none of the 6 ways of creating bins with this new function will produce the bins hist and histc produce. 请注意,使用此新功能创建垃圾箱的6种方法均不会产生垃圾箱histhistc垃圾箱。 There is a Matlab script to update former code to fit the way histogram is called (bin edges instead of bin centers - link ). 有一个Matlab脚本可以更新以前的代码,以适应histogram的调用方式(bin边而不是bin中心-link )。 By doing so, one can compare the pdf normalization methods of @abcd ( trapz and sum ) and Matlab ( pdf ). 这样一来,可以比较 @abcd( trapzsum )和Matlab( pdfpdf归一化方法

The 3 pdf normalization method give nearly identical results (within the range of eps ) . 3 pdf归一化方法给出的结果几乎相同(在eps范围内)

TEST: 测试:

A = randn(10000,1);
centers = -6:0.5:6;
d = diff(centers)/2;
edges = [centers(1)-d(1), centers(1:end-1)+d, centers(end)+d(end)];
edges(2:end) = edges(2:end)+eps(edges(2:end));

figure;
subplot(2,2,1);
hist(A,centers);
title('HIST not normalized');

subplot(2,2,2);
h = histogram(A,edges);
title('HISTOGRAM not normalized');

subplot(2,2,3)
[counts, centers] = hist(A,centers); %get the count with hist
bar(centers,counts/trapz(centers,counts))
title('HIST with PDF normalization');


subplot(2,2,4)
h = histogram(A,edges,'Normalization','pdf')
title('HISTOGRAM with PDF normalization');

dx = diff(centers(1:2))
normalization_difference_trapz = abs(counts/trapz(centers,counts) - h.Values);
normalization_difference_sum = abs(counts/sum(counts*dx) - h.Values);

max(normalization_difference_trapz)
max(normalization_difference_sum)

在此处输入图片说明

The maximum difference between the new PDF normalization and the former one is 5.5511e-17. 新的PDF规范化与以前的规范化之间的最大差是5.5511e-17。

hist can not only plot an histogram but also return you the count of elements in each bin, so you can get that count, normalize it by dividing each bin by the total and plotting the result using bar . hist不仅可以绘制直方图,还可以向您返回每个bin中的元素计数,因此您可以获取该计数,将每个bin除以总数,然后使用bar绘制结果,将其标​​准化。 Example: 例:

Y = rand(10,1);
C = hist(Y);
C = C ./ sum(C);
bar(C)

or if you want a one-liner: 或者如果您想要单线:

bar(hist(Y) ./ sum(hist(Y)))

Documentation: 说明文件:

Edit: This solution answers the question How to have the sum of all bins equal to 1 . 编辑:此解决方案回答了问题: 如何使所有垃圾箱的总和等于1 This approximation is valid only if your bin size is small relative to the variance of your data. 仅当bin大小相对于数据方差较小时,这种近似才有效。 The sum used here correspond to a simple quadrature formula, more complex ones can be used like trapz as proposed by RM 此处使用的总和对应于一个简单的正交公式,可以使用更复杂的公式,例如RM建议的trapz

[f,x]=hist(data)

The area for each individual bar is height*width. 每个单独的条的面积是高度*宽度。 Since MATLAB will choose equidistant points for the bars, so the width is: 由于MATLAB将为条形图选择等距点,因此宽度为:

delta_x = x(2) - x(1)

Now if we sum up all the individual bars the total area will come out as 现在,如果我们汇总所有单个条,则总面积将为

A=sum(f)*delta_x

So the correctly scaled plot is obtained by 因此,正确缩放的图可以通过

bar(x, f/sum(f)/(x(2)-x(1)))

The area of abcd`s PDF is not one, which is impossible like pointed out in many comments. abcd的PDF区域不全,就像许多评论所指出的那样,这是不可能的。 Assumptions done in many answers here 这里的许多答案中的假设

  1. Assume constant distance between consecutive edges. 假设连续边之间的距离恒定。
  2. Probability under pdf should be 1. The normalization should be done as Normalization with probability , not as Normalization with pdf , in histogram() and hist(). pdf下的概率应为1。在histogram()和hist()中,归一化应以probability进行Normalization ,而不是pdf Normalization

Fig. 1 Output of hist() approach, Fig. 2 Output of histogram() approach 图1 hist()方法的输出,图2 histogram()方法的输出

在此处输入图片说明 在此处输入图片说明

The max amplitude differs between two approaches which proposes that there are some mistake in hist()'s approach because histogram()'s approach uses the standard normalization. 两种方法之间的最大幅度不同,这表明hist()的方法存在一些错误,因为histogram()的方法使用标准归一化。 I assume the mistake with hist()'s approach here is about the normalization as partially pdf , not completely as probability . 我认为这里hist()方法的错误是关于规范化的部分pdf ,而不是完全的probability

Code with hist() [deprecated] 使用hist()的代码[不建议使用]

Some remarks 一些评论

  1. First check: sum(f)/N gives 1 if Nbins manually set. 第一次检查:如果手动设置了Nbins sum(f)/N1
  2. pdf requires the width of the bin ( dx ) in the graph g pdf需要图g bin的宽度( dx

Code

%http://stackoverflow.com/a/5321546/54964
N=10000;
Nbins=50;
[f,x]=hist(randn(N,1),Nbins); % create histogram from ND

%METHOD 4: Count Densities, not Sums!
figure(3)
dx=diff(x(1:2)); % width of bin
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND with dx
% 1.0000
bar(x, f/sum(f));hold on
plot(x,g,'r');hold off

Output is in Fig. 1. 输出如图1所示。

Code with histogram() 带直方图的代码()

Some remarks 一些评论

  1. First check: a) sum(f) is 1 if Nbins adjusted with histogram()'s Normalization as probability, b) sum(f)/N is 1 if Nbins is manually set without normalization. 第一检查:a) sum(f)1 ,如果Nbins与直方图调整()的作为概率,B)标准化sum(f)/N是1,如果Nbins手动设置不正常化。
  2. pdf requires the width of the bin ( dx ) in the graph g pdf需要图g bin的宽度( dx

Code

%%METHOD 5: with histogram()
% http://stackoverflow.com/a/38809232/54964
N=10000;

figure(4);
h = histogram(randn(N,1), 'Normalization', 'probability') % hist() deprecated!
Nbins=h.NumBins;
edges=h.BinEdges; 
x=zeros(1,Nbins);
f=h.Values;
for counter=1:Nbins
    midPointShift=abs(edges(counter)-edges(counter+1))/2; % same constant for all
    x(counter)=edges(counter)+midPointShift;
end
dx=diff(x(1:2)); % constast for all
g=1/sqrt(2*pi)*exp(-0.5*x.^2) .* dx; % pdf of ND
% Use if Nbins manually set
%new_area=sum(f)/N % diff of consecutive edges constant
% Use if histogarm() Normalization probability
new_area=sum(f)
% 1.0000
% No bar() needed here with histogram() Normalization probability
hold on;
plot(x,g,'r');hold off

Output in Fig. 2 and expected output is met: area 1.0000. 图2中的输出和预期的输出均满足:面积1.0000。

Matlab: 2016a Matlab:2016a
System: Linux Ubuntu 16.04 64 bit 系统:Linux Ubuntu 16.04 64位
Linux kernel 4.6 Linux内核4.6

在MATLAB中有一个非常好的三部分直方图调整指南( 断开的原始链接archive.org链接 ),第一部分是直方图拉伸。

For some Distributions, Cauchy I think, I have found that trapz will overestimate the area, and so the pdf will change depending on the number of bins you select. Cauchy我认为,对于某些发行版,我发现trapz会高估该区域,因此pdf会根据您选择的bin数量而变化。 In which case I do 在这种情况下

[N,h]=hist(q_f./theta,30000); % there Is a large range but most of the bins will be empty
plot(h,N/(sum(N)*mean(diff(h))),'+r')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM