简体   繁体   中英

How can I extract data from a kernel density function in R for many samples at one time

I have a very large data file (>300k rows) and each row is part of a unique sample (>3000 samples). I want to generate a kernel density estimator for each individual sample and extract relevant information (minimum value, maximum value, maximum probability of density estimator, median of density estimator, mean of density estimator) into a separate table along with the sample name.

I have tried to extract information from the ggplot function stat_density_ridges() using the approaches laid out here Adding a mean to geom_density_ridges and here draw line on geom_density_ridges which extract data from stat_density_ridges and ggplot_build with purrr::pluck but it doesn't provide all the information that I want.

The following generates some synthetic data similar to what I want:

set.seed(1)
x = runif( 50, max = 40, min = 20 )
set.seed(2)
y = runif( 50, max = 300, min = 100 )
sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
d <- data.frame( x, y , sample.number ) 

And the plot in ggplot that shows the distribution:

ggplot( data = d, aes( x = x, y = as.factor( samples ) ) ) +
  labs( x = expression( paste( "x" ) ), 
    y = expression( paste( "sample number" ) ) ) +
  stat_density_ridges() 

I would like to end up with a data table with the following information: sample.name , max(x) , min(x) , maximum height of the kernel density estimator and it's x location, median height of the kernel density estimator and it's x location, etc.

The only thing I can think to do is to create a long and arduous loop

sample.numbers <- rep( NA, times = max( d$sample.number ) )
max.x <- rep( NA, times = max( d$sample.number ) )
min.x <- rep( NA, times = max( d$sample.number ) )

for( i in 1:max( d$sample.number ) ) {
  temp.d = d[ d$sample.number == i, ]
  sample.numbers[ i ] = i
  max.x[ i ] = max( temp.d$x )
  min.x[ i ] = min( temp.d$x )
}

and then somehow adding in a bit that creates a density estimator and extracts the information from that. I'm guessing the indexing in R presents an easier way to get through this for the many thousands of samples I have while using group_by , but I cannot figure it out. Note, I'm still having trouble getting my head around piping in R, so some simple explaining might be required if solutions have that in them.

There are different ways to do this. Using dplyr and the pipe operator is, in my opinion, the easiest way. I tried adding comments in the code to make it easier to understand. Take a look at this dplyr cheat sheet .

Basically, you use group_by to divide your data frame in groups according to sample.number . Then you use summarise to compute summary metrics of the column x inside each group.

To compute the density you can use density() from base R inside the summarise . This will return a list with a sample of (x,y) values of the density function. To extract quantiles from this density function, you can use the package spatstat .

One observation: density() computes a value for the bandwidth that depends on the data set. Since we are separating the different groups, each group can end up having a different bandwidth value. I used the function bw.nrd to estimate a single value of the bandwidth using the complete dataset. Then I use this single bandwidth value for all computations.

# needed to extract quantile from a pdf computed with density()
library(spatstat)
# packages for data wrangling
library(plyr)
library(dplyr)
# ploting
library(ggplot2)
library(ggridges)

# creata data set
set.seed(1)
x = runif( 50, max = 40, min = 20 )
set.seed(2)
y = runif( 50, max = 300, min = 100 )
sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
d <- data.frame( x, y , sample.number )

# first compute bandwidth over all samples
# if you don't do this, each pdf in the table will have a different bandwidth
# bw.nrd is a function that computes bandwidth for a kernel density using a "rule of thumb" formula
# there are other functions that you can use to estimate bw
bw <- bw.nrd(d$x)

# create the table using the pipe operator and dplyr
# the pipe operator '%>%' takes what is on the left side and puts inside the function
# on the right side as an argument
d %>% 
  # group rows of 'd' by sample number (this is equivalent to your for loop)
  group_by(sample.number) %>%
  # before computing the summaries for each group, create a new column with the 
  # number of elements in each sample (the resulting DF still has 50 rows)
  mutate(n=n()) %>%
  # now remove rows that belong to groups with less than 5 elements (you can change the threshold value here)
  filter(n > 5) %>%
  # for each group in 'd' compute these summary metrics
  summarise(max.x=max(x),
            min.x=min(x), 
            max.density=max(density(x, bw = bw)$y),
            x.mode=density(x, bw = bw)$x[which(density(x, bw = bw)$y == max.density)],
            x.median=quantile(density(x, bw = bw), 0.5),
            median.density=density(x, bw = bw)$y[which(density(x, bw = bw)$x == x.median)])

# OUTPUT (note that sample.number == 3 was removed from the table)
#># A tibble: 3 x 7
#>  sample.number max.x min.x max.density x.mode x.median median.density
#>          <dbl> <dbl> <dbl>       <dbl>  <dbl>    <dbl>          <dbl>
#>1             1  39.8  21.2      0.0568   34.3     31.4         0.0503
#>2             2  38.7  20.3      0.0653   26.9     28.4         0.0628
#>3             4  36.4  20.5      0.0965   33.9     33.0         0.0939
#

# see the pdfs using stat_density_ridges
# (note that i am fixing the bandwidth)
ggplot( data = d, aes( x = x, y = as.factor( sample.number ) ) ) +
  labs( x = expression( paste( "x" ) ), 
        y = expression( paste( "sample number" ) ) ) +
  stat_density_ridges(bandwidth = bw) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM