简体   繁体   English

计算直方图或密度函数中的峰值

[英]Calculating peaks in histograms or density functions

There seem to be a lot of "peaks in density function" threads already, but I don't see one addressing this point specifically.似乎已经有很多“密度函数峰值”线程,但我没有看到专门解决这一点的问题。 Sorry to duplicate if I missed it.如果我错过了,很抱歉复制。

My problem: Given a vector of 1000 values (sample attached), I would like to identify the peaks in the histogram or density function of the data.我的问题:给定一个包含 1000 个值的向量(附有样本),我想确定数据的直方图或密度函数中的峰值。 From the image of the sample data below , I can see peaks in the histogram at ~0, 6200, and 8400. But I need the obtain the exact values of these peaks, preferably in a simple procedure as I have several thousand of these vectors to process.从下面的示例数据图像中,我可以在直方图中看到 ~0、6200 和 8400 处的峰值。但我需要获得这些峰值的确切值,最好是在一个简单的过程中,因为我有几千个这些向量处理。

历史和密度函数

I originally started working with the histogram outputs themselves, but couldn't get any peak-finding command to work properly (like, not at all).我最初开始使用直方图输出本身,但无法让任何寻峰命令正常工作(例如,根本没有)。 I'm not even sure how it would get the peaks() command from the splus2R package to work on histogram object or on a density object.我什至不确定如何从splus2R包中获取peaks()命令以处理直方图对象或密度对象。 This would still be my preference, as I would like to identify the exact data value of the max frequency of each peak (as opposed to the density function value, which is slightly different), but I can't figure that one out either.这仍然是我的偏好,因为我想确定每个峰值的最大频率的确切数据值(与密度函数值相反,后者略有不同),但我也无法弄清楚。

I would post the sample data themselves, but I can't see a way to do that on here (sorry if I'm just missing it).我会自己发布示例数据,但我在这里看不到这样做的方法(对不起,如果我只是错过了它)。

如果你的y值是平滑的(就像在你的样本图中那样),这应该可以非常重复地找到峰值:

peakx <- x[which(diff(sign(diff(y)))==-2)]

Since you are thinking about histograms, maybe you should use the histogram output directly? 既然您正在考虑直方图,也许您应该直接使用直方图输出?

data <- c(rnorm(100,mean=20),rnorm(100,mean=12))

peakfinder <- function(d){
  dh <- hist(d,plot=FALSE)
  ins <- dh[["intensities"]]
  nbins <- length(ins)
  ss <- which(rank(ins)%in%seq(from=nbins-2,to=nbins)) ## pick the top 3 intensities
  dh[["mids"]][ss]
}

peaks <- peakfinder(data)

hist(data)
sapply(peaks,function(x) abline(v=x,col="red"))

This isn't perfect -- for example, it will find just the top bins, even if they are adjacent. 这并不完美 - 例如,即使它们相邻,它也会找到顶部的箱子。 Maybe you could define 'peak' more precisely? 也许你可以更准确地定义'峰值'? Hope that helps. 希望有所帮助。

在此输入图像描述

Finding Peaks in density functions is, as already given in the comments, related to Finding local maxima and minima where you can find more solutions. 正如评论中已经给出的那样, 在密度函数中查找峰值查找局部最大值和最小值有关 ,您可以在其中找到更多解法。 The answer of chthonicdaemon is close to the peak, but each diff is reducing the vector length by one. chthonicdaemon的答案接近峰值,但每个diff都将矢量长度减少一个。

#Create Dataset
x <- c(1,1,4,4,9)

#Estimate Density
d <- density(x)

#Two ways to get highest Peak
d$x[d$y==max(d$y)]  #Gives you all highest Peaks
d$x[which.max(d$y)] #Gives you the first highest Peak

#3 ways to get all Peaks
d$x[c(F, diff(diff(d$y)>=0)<0)] #This detects also a plateau
d$x[c(F, diff(sign(diff(d$y)))<0)]
d$x[which(diff(sign(diff(d$y)))<0)+1]

#In case you also want the height of the peaks
data.frame(d[c("x", "y")])[c(F, diff(diff(d$y)>=0)<0),]

#In case you need a higher "precision"
d <- density(x, n=1e4)

After a good 8+ years later this is still a valid and classic question.在 8 年多之后,这仍然是一个有效且经典的问题。 Here's a complete answer now with the excellent clue given by @chthonicdaemon.这是一个完整的答案,@chthonicdaemon 给出了很好的线索。

library(ggplot)
library(data.table)
### I use a preloaded data.table. You can use any data.table with one numeric column x.
### Extract counts & breaks of the histogram bins. 
### I have taken breaks as 40 but you can take any number as needed.
### But do keep a large number of breaks so that you get multiple peaks.
counts <- hist(dt1$x,breaks = 40)$counts
breaks <- hist(dt1$x, breaks = 40)$breaks
### Note: the data.table `dt1` should contain at least one numeric column, x

### now name the counts vector with the corresponding breaks 
### note: the length of counts is 1 less than the breaks
names(counts) <- breaks[-length(breaks)]

### Find index for those counts that are the peaks 
### (see previous classic clue to take a double diff)
### note: the double diff causes the 2 count shrink, hence
#### I have added a FALSE before and after the results 
### to align the T/F vector with the count vector

peak_indx <- c(F,diff(sign(c(diff(counts))))==-2,F) %>% which()
topcounts <- counts[peak_indx]
topbreaks <- names(topcounts) %>% as.numeric()

### Now let's use ggplot to plot the histogram along with visualised peaks

dt1 %>%     
ggplot() + 
geom_histogram(aes(x),bins = 40,col="grey51",na.rm = T) + 
geom_vline(xintercept = topbreaks + 50,lty = 2) 
# adjust the value 50 to bring the lines in the centre

带有标记的峰值的输出直方图

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM