简体   繁体   English

如何为连续数据指定ggplot2箱线图填充颜色?

[英]How to specify ggplot2 boxplot fill colour for continuous data?

I want to plot a ggplot2 boxplot using all columns of a data.frame, and I want to reorder the columns by the median for each column, rotate the x-axis labels, and fill each box with the colour corresponding to the same median . 我想使用data.frame的所有列来绘制ggplot2箱线图,并且想按每列的中位数对列进行重新排序,旋转x轴标签,并用与同一中位数相对应的颜色填充每个框 I can't figure out how to do the last part. 我不知道如何做最后一部分。 There are plenty of examples where the fill colour corresponds to a factor variable, but I haven't seen a clear example of using a continuous variable to control fill colour. 有很多示例,其中填充颜色对应于一个因子变量,但是我还没有看到使用连续变量控制填充颜色的清晰示例。 (The reason I'm trying to do this is that the resultant plot will provide context for a force-directed network graph with nodes that will be colour-coded in the same way as the boxplot -- the colour will then provide a mapping between the two plots.) It would be nice if I could re-use the value-to-colour mapping for later plots so that colours are consistent between plots. (我尝试这样做的原因是,结果图将为力导向网络图提供上下文,该图的节点将以与箱图相同的方式进行颜色编码-然后颜色将提供如果可以在以后的绘图中重新使用“颜色到颜色”的映射,以使各个绘图之间的颜色保持一致,那就太好了。 So, for example, the box corresponding to the column variable with a high median value will have a colour that denotes this mapping and matches perfectly the colour for the same column variable in other plots (such as the corresponding node in a force-directed network graph). 因此,例如,与具有较高中位数的列变量相对应的框将具有表示此映射的颜色,并与其他图中的相同列变量的颜色完全匹配(例如,力导向网络中的相应节点)图形)。

So far, I have something like this: 到目前为止,我有这样的事情:

# Melt the data.frame:
DT.m <- melt(results, id.vars = NULL) # using reshape2
# I can now make a boxplot for every column in the data.frame:
g <- ggplot(DT.m, aes(x = reorder(variable, value, FUN=median), y = value)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
geom_boxplot(???, alpha=0.5)

The colour fill information is what I'm stuck on. 颜色填充信息就是我所坚持的。 "value" is a continuous variable in the range [0,1] and there are 55 columns in my data.frame. “值”是在[0,1]范围内的连续变量,在我的data.frame中有55列。 Various approaches I've tried seem to result in the boxes being split vertically down the middle, and I haven't got any further. 我尝试过的各种方法似乎导致盒子在中间垂直向下分割,而我没有得到更多。 Any ideas? 有任何想法吗?

You can do this by adding the median-by-group to your data frame and then mapping the new median variable to the fill aesthetic. 您可以通过按组将中位数添加到数据框中,然后将新的中位数变量映射到填充美学来做到这一点。 Here's an example with the built-in mtcars data frame. 这是内置mtcars数据框的示例。 By using this same mapping across different plots, you should get the same colors: 通过在不同地块上使用相同的映射,您应该获得相同的颜色:

library(ggplot2)
library(dplyr)

ggplot(mtcars %>% group_by(carb) %>%
         mutate(medMPG = median(mpg)), 
       aes(x = reorder(carb, mpg, FUN=median), y = mpg)) +
  geom_boxplot(aes(fill=medMPG)) +
  stat_summary(fun.y=mean, colour="darkred", geom="point") +
  scale_fill_gradient(low=hcl(15,100,75), high=hcl(195,100,75))

在此处输入图片说明

If you have various data frames with different ranges of medians, you can still use the method above, but to get a consistent mapping of color to median across all your plots, you'll need to also set the same limits for scale_fill_gradient in each plot. 如果您有各种具有不同中值范围的数据框,仍然可以使用上面的方法,但是要在所有绘图中获得颜色到中值的一致映射,还需要在每个绘图中为scale_fill_gradient设置相同的limits In this example, the median of mpg (by carb grouping) varies from 15.0 to 22.8. 在此示例中, mpg (按carb分组)的中位数从15.0到22.8不等。 But let's say across all my data sets, it varies from 13.3 to 39.8. 但是,在我所有的数据集中,它的范围从13.3到39.8。 Then I could add this to all my plots: 然后,我可以将其添加到我的所有地块中:

scale_fill_gradient(limits=c(13.3, 39.8), 
                    low=hcl(15,100,75), high=hcl(195,100,75))

This is just for illustration. 这仅用于说明。 For ease of maintenance if your data might change, you'll want to set the actual limits programmatically. 为了便于维护,如果您的数据可能会更改,您需要以编程方式设置实际限制。

I built on eipi10's solution and obtained the following code which does what I want: 我以eipi10的解决方案为基础,并获得了以下代码,该代码可以满足我的要求:

# "results" is a 55-column data.frame containing 
# bootstrapped estimates of the Gini impurity for each column variable
# (But can synthesize fake data for testing with a bunch of rnorms)
DT.m <- melt(results, id.vars = NULL) # using reshape2
g <- ggplot(DT.m %>% group_by(variable) %>%
          mutate(median.gini = median(value)), 
        aes(x = reorder(variable, value, FUN=median), y = value))  +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot(aes(fill=median.gini)) +
stat_summary(fun.y=mean, colour="darkred", geom="point") +
scale_fill_gradientn(colours = heat.colors(9)) +
ylab("Gini impurity") +
xlab("Feature") +
guides(fill=guide_colourbar(title="Median\nGini\nimpurity"))
plot(g)

Later, for the second plot: 后来,对于第二个情节:

medians <- lapply(results, median)
color <- colorRampPalette(colors = 
heat.colors(9))(1000)[cut(unlist(medians),1000,labels = F)]

color is then a character vector containing the colours of the nodes in my subsequent network graph, and these colours match those in the boxplot. color是一个字符向量,其中包含我后续网络图中节点的颜色,并且这些颜色与箱图中的颜色匹配。 Job done! 任务完成!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM