简体   繁体   English

根据密度对数据帧进行分组

[英]Subset a data frame based on density

I'm looking for a function that would allow for subsetting a data frame based on the density of bivariate observations. 我正在寻找一个功能,该功能允许根据双变量观测值的密度对数据框进行设置。 For example: 例如:

ggplot(iris, aes(x = Petal.Length, y = Sepal.Width, color = Species)) + 
  stat_density2d(geom = 'polygon', aes(fill = ..level..), n = 8) + 
  geom_point()

Here, I would like to only display the points that are outliers based on the density of points within a Species (ie only show the 3 points from setosa and 4 points from virginica that lay outside the contours). 在这里,我只想显示基于物种内点密度的离群点(即,仅显示轮廓线之外的setosa的3个点和virginica的4个点)。

This is a rather hack-y solution, but you could write a function to extract the points outside of the contour plot and return a data frame with just those points: 这是一个很hacky的解决方案,但是您可以编写一个函数来提取等高线图之外的点,并仅返回包含这些点的数据框:

plot_outliers_only <- function (original_plot) {
  require(ggplot2)
  require(sp)
  pb <- ggplot_build(original_plot)
  group_labels <- grep("001", levels(pb$data[[1]]$group), value=TRUE)
  outlier_points <- lapply(group_labels, function (gl) {
    contour_data <- filter(pb$data[[1]], as.character(group)==gl)
    original_data <- 
    group_id <- as.numeric(strsplit(gl, "-")[[1]][1])
    outlier_id <- pb$data[[2]] %>%
      filter(group==group_id) %>%
      select(c(x, y)) %>%
      apply(1, function (point) {
        point.in.polygon(point[1], point[2], contour_data$x, contour_data$y)==0
      }) %>%
      which()
    if (length(outlier_id)==0) return (outlier_id)
    grouping_name <- as.character(original_plot$mapping$colour)
    as.numeric(original_plot$data[, grouping_name]) %>%
      `==`(group_id) %>%
      which() %>%
      slice(original_plot$data, .) %>%
      `[`(., outlier_id, )
  })
  do.call(what=rbind, outlier_points)
}

P <- ggplot(iris, aes(x = Petal.Length, y = Sepal.Width, color = Species)) + 
    stat_density2d(geom = 'polygon', aes(fill = ..level..), n = 8) + 
    geom_point()

plot_outliers_only(P)

My methodology is a little convoluted, so bear with me, I'll explain below: 我的方法有些复杂,请耐心等待,我在下面说明:

library(data.table)
dt <- as.data.table(iris)[, .(Petal.Length, Sepal.Width, Species)]
dt[, sample := .I]
dt <- melt(dt, id.vars = c("Species", "sample"))
dt[, c("meanval", "sdval") := .(mean(value), sd(value)), .(Species, variable)]
dt[abs({value - meanval} / sdval) > 2, outlier := TRUE]
dt[, anyOutliers := sum(outlier, na.rm = T), sample]
dt[anyOutliers != 0, outlier := TRUE]
dt <- dcast(
  dt[, .(Species, variable, value, outlier, sample)],
  sample + outlier + Species ~ variable,
  value.var = "value"
)

First we assign dt as the data set and keep only the columns we plan to plot. 首先,我们将dt分配为数据集,并仅保留计划绘制的列。 Next, we assign a dummy column which will be important for this particular dataset to differentiate rows later. 接下来,我们分配一个虚拟列,这对于该特定数据集在以后区分行而言将非常重要。 Then we melt() the dataset for expediency. 然后我们melt()数据集以方便。 Then, for each species, we calculate the mean and standard deviation of each value. 然后,对于每个物种,我们计算每个值的平均值和标准偏差。 This allows us to, on the line below, define outliers (you can change > 2 here to affect the number of SD to use). 这使我们能够在下面的行中定义离群值(您可以在此处更改> 2来影响要使用的SD数量)。

Then, for each flower, we find if it is an outlier in any of our chosen metrics (in this case petal.length and sepal.width). 然后,对于每朵花,我们都会发现它是否在我们选择的任何指标(在本例中为petal.length和sepal.width)中都是离群值。 If it is, the whole flower gets labelled an outlier. 如果是,则将整个花标记为离群值。 Then, we dcast the table back into it's original form, only now there's an outlier column that shows whether or not the flower was an outlier in any of out metrics. 然后,我们将表格抛回原始形式,只是现在有一个异常值列,该列显示花朵在任何度量标准中是否都是异常值。

I won't go into plotting these, as you can figure out how you want to do that on your own, but this should give a general gist of the direction to go. 我将不涉及绘制这些内容,因为您可以弄清楚自己想怎么做,但这应该为您提供前进的方向。 Hope that helps. 希望能有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM