简体   繁体   English

以几种方式在 R 和 ggplot2 中显示相对频率或密度直方图

[英]displaying relative frequency or density histograms in R and in ggplot2 in a few ways

i would appreciate your help on the following please (i should add that i have sent an email also to ggplot2 mailing list, and have not heard from anyone yet).我将感谢您在以下方面的帮助(我应该补充一点,我已将 email 也发送到 ggplot2 邮件列表,但尚未收到任何人的消息)。

we do have a dataframe with a FACTOR called EXP that has 3 LEVELS (DMSO, DMSO1, DMSO2)我们确实有一个 dataframe 有一个名为 EXP 的因子,它有3 个级别(DMSO、DMSO1、DMSO2)

 head(pp_ALL)
    VALUE  EXP
1 1639742 DMSO
2 1636822 DMSO
3 1634202 DMSO

shall i aim to overlay the relative frequency histograms, or the density histograms for the FACTOR LEVELS,的目标是覆盖相对频率直方图,还是因子水平的密度直方图

please would you let me know why the following 2 pieces of R code show very different results (in terms of the height of the density histograms, and interpretation):请您告诉我为什么以下两段 R 代码显示出非常不同的结果(就密度直方图的高度和解释而言):

ggplot(pp_ALL, aes(x=VALUE, colour=EXP)) + geom_density()

versus相对

ggplot(data=pp_ALL) +
       geom_histogram(mapping=aes(x=VALUE, y=..density.., colour=EXP),  bins=1000) 

thanks,谢谢,

bogdan波格丹

Let's compare two examples below.让我们比较下面的两个例子。 With default settings on this data, they look pretty similar in shape and density.使用此数据的默认设置,它们的形状和密度看起来非常相似。 They do have slight differences based on how they work, with the density plot using a smoothing algorithm while the histogram uses discrete bins.根据它们的工作方式,它们确实存在细微差别,密度 plot 使用平滑算法,而直方图使用离散箱。 This can sometimes make the histogram more easily interpretable (height = what share of obs in that bin, scaled by bin width), while the density plot might be more reflective of an underlying smooth distribution.这有时可以使直方图更容易解释(高度 = 该 bin 中 obs 的份额,按 bin 宽度缩放),而密度 plot 可能更能反映底层的平滑分布。 But you'll note that here they're pretty similar.但是您会注意到,它们在这里非常相似。

diamonds %>%
  mutate(cut = factor(cut)) %>%
  ggplot() + 
  geom_density(aes(x = carat, color = cut)) +
  facet_wrap(~cut)

在此处输入图像描述

diamonds %>%
  mutate(cut = factor(cut)) %>%
  ggplot() + 
  geom_histogram(aes(x = carat, y = ..density.., fill = cut)) +
  facet_wrap(~cut)

在此处输入图像描述

With alternative settings, they could look very different.使用其他设置,它们可能看起来非常不同。 If we made the bandwidth on the density plot 1/10th as big, or used 10x as many bins for the histogram, it would make the spikes narrower and higher, while continuing to integrate in area to 1, just as before.如果我们将密度 plot 的带宽设置为原来的 1/10 大,或者使用 10 倍于直方图的 bin,它将使尖峰更窄更高,同时像以前一样继续在面积上积分为 1。 I presume the main thing you are seeing is that your histogram has a much higher granularity (bins = 1000) than your density plot, so any spikes will be narrower and taller than in the density plot.我认为您看到的主要内容是您的直方图的粒度(bin = 1000)比密度 plot 高得多,因此任何尖峰都将比密度 plot 更窄更高。

geom_density(aes(x = carat, color = cut), adjust = 1/10) + ... geom_density(aes(x = carat, color = cut), adjust = 1/10) + ...

在此处输入图像描述

geom_histogram(aes(x = carat, y =..density.., fill = cut), bins = 300) + ... geom_histogram(aes(x = carat, y =..density.., fill = cut), bins = 300) + ...

在此处输入图像描述

One other difference in behavior you might notice in your case with multiple factors is that the density plot uses lines without stacking, while the histogram stacks -- this will tend to make the histogram taller even if it has equivalent bandwidth settings, because it is stacking the densities for each factor.在具有多个因素的情况下,您可能会注意到的另一个行为差异是密度 plot 使用没有堆叠的线条,而直方图堆叠 - 即使它具有相同的带宽设置,这也会使直方图更高,因为它正在堆叠每个因子的密度。

Same as orig but w/o facet_wrap.与 orig 相同,但没有 facet_wrap。

在此处输入图像描述 [Try geom_density(aes(x = carat, color = cut), position = "stack") to make it more like the histogram.] [尝试geom_density(aes(x = carat, color = cut), position = "stack")使其更像直方图。]

在此处输入图像描述 Try geom_histogram(aes(x = carat, y =..density.., fill = cut), position = "dodge") to make it more like the density plot.]尝试geom_histogram(aes(x = carat, y =..density.., fill = cut), position = "dodge")使其更像密度 plot。]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM