简体   繁体   English

R中的密度图(ggplot2)由变量着色,返回的分布与直方图和频率图有很大不同吗?

[英]Density plot in R (ggplot2), colored by variable, returning very different distribution than histogram and frequency plot?

I've combed through several questions on here already and I can't seem to figure out what's happening with my density plots. 我已经在这里梳理了几个问题,而且似乎无法弄清楚我的密度图正在发生什么。

I have a set of radiocarbon dates which are attributed to different cultures. 我有一组放射性碳数据,这些数据归因于不同的文化。 I need to display the frequencies of dates through time, but distinguish the dates by culture. 我需要显示日期随时间变化的频率,但要按文化区分日期。 A stacked histogram does the job (Fig. 1), but their use is generally discouraged, so that's out of the question, yet I want something smoother than a frequency plot (Fig. 2). 堆叠的直方图可以完成此任务(图1),但是通常不鼓励使用它们,因此这是不可能的,但是我想要比频率曲线更平滑的图(图2)。

Figure 1: Histogram 图1:直方图

图1.直方图。

Figure 2: Frequency plot 图2:频率图

图2.频率图。

When I produce a density plot coloured by culture (Fig. 3), the relative distribution of the cultures on the y-axis change drastically from their counts. 当我绘制出一个以培养物上色的密度图时(图3),培养物在y轴上的相对分布从其计数开始急剧变化。 For example, in the density plot, the blue density curve is much higher than that of the purple; 例如,在密度图中,蓝色密度曲线远高于紫色曲线; yet, in the histogram, we can see that there are way more dates attributed to the purple group. 但是,在直方图中,我们可以看到还有更多的日期归因于紫色组。

Figure 3: Density plot 图3:密度图

图3.密度图。

Am I doing something wrong with my code (see below)? 我的代码有问题吗? Or perhaps I need to scale the density curves in some way? 还是我需要以某种方式缩放密度曲线? Or is there something about density plots I'm not understanding? 还是我不了解的密度图? (Disclaimer: my knowledge of stats is fairly weak) (免责声明:我对统计资料的了解还很薄弱)

Thanks in advance! 提前致谢!

ggplot(test, aes(x=CalBP))+
theme_tufte(base_family="sans")+
theme(axis.line=element_line(), axis.text=element_text(color="black")) +
theme(legend.position="none") +
theme(text=element_text(size=14)) +
geom_density(aes(color=factor(Culture), fill=factor(Culture)), alpha = 0.5) +
scale_x_reverse() +
labs(x="Cal. B.P.") +
ylab(expression("Density")) +
coord_cartesian(xlim = c(4773, 225)) +
scale_fill_manual(values=c("#cf9045", "#ebe332", "#5f9388", "#6abeef", "#9d88d6")) +
scale_color_manual(values=c("#cf9045", "#ebe332", "#5f9388", "#6abeef", "#9d88d6")) 

The difference is that a density plot is scaled so that the total area under the curve is 1. It's function is to model a probability density function, which (by definition) has area 1. 所不同的是,对密度图进行了缩放,以使曲线下的总面积为1。其功能是对概率密度函数建模(根据定义)其面积为1。

If every group in your data had the same number of observations, then the only difference between the density plot and the histogram would be the y-axis. 如果数据中的每个组都具有相同数量的观察值,则密度图和直方图之间的唯一区别就是y轴。 When you have different numbers of observations, the density plot normalizes for this (each will have total area 1), while the bars of the histogram are much higher for the group with more observations. 当您具有不同数量的观察值时,对此的密度图将进行归一化(每个都会具有总面积1),而对于具有更多观察值的组,直方图的条形会更高。
In base R, you can fix this in the histogram by setting freq = FALSE , but I've not seen density plots scaled up to histograms - it's usually more interesting to ignore the effects of the relative sample sizes. 在基本R中,您可以通过设置freq = FALSE来在直方图中解决此问题,但是我没有看到密度图按比例直方图缩放-通常更有趣的是忽略相对样本大小的影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM