简体   繁体   English

使用带分位数的geom_boxplot

[英]Using geom_boxplot with quantiles

Question

I'd like to use ggplot's geom_boxplot and use my own data columns for the quantile segments, instead of those returned by stat_boxplot . 我想使用ggplot的geom_boxplot并使用我自己的数据列作为分位数段,而不是stat_boxplot返回的stat_boxplot

The data, after doing some transformations, looks like this: 进行一些转换后,数据如下所示:

> allquartile                                                      
      T method       s.0%      s.25%      s.50%      s.75%     s.100%                                                                                                    
1     2    LDA -196.76273 -190.38842 -184.01411 -177.63979 -171.26548                                                                                                    
2     3    LDA -171.53987 -166.16923 -160.79859 -115.28652  -69.77446                                                                                                    
3     4    LDA -161.17590 -157.61372 -149.71026 -124.68926  -69.77446                                                                                                    
4     5    LDA -194.10553 -179.83165 -175.14337 -168.46104 -159.07206 

After doing a lot of searching and digging, I figured out that my plotting command should look like this: 经过大量的搜索和挖掘,我发现我的绘图命令应如下所示:

p <- ggplot(allquartile,aes(x=T, ymin=`s.0%`, lower=`s.25%`,
                            middle=`s.50%`, upper=`s.75%`,
                            ymax=`s.100%`, color=method)) + 
     geom_boxplot(stat="identity")

This should use s.0% as the min, s.25% as the lower, etc etc. But when i try to display p , i get the following error: 应该使用s.0%作为min,s.25%作为较低等等。但是当我尝试显示p ,我得到以下错误:

Error in eval(expr, envir, enclos) : object 's.0%' not found                                                                                                             
Calls: print ... lapply -> is.vector -> lapply -> FUN -> eval -> eval

I've also tried using aes_string in place of aes , and I instead get this error: 我也尝试使用aes_string代替aes ,而我得到了这个错误:

Error in aes_string(x = T, ymin = `s.0%`, lower = `s.25%`, middle = `s.50%`,  :                                                                                            
object 's.0%' not found 

I'm fairly new to both R and ggplot2, so i'm not realy sure how to interpret this, but I'm assuming it's because of the . 我对R和ggplot2都很新,所以我不确定如何解释这个,但我假设它是因为. in s.0% . s.0%

I'd greatly appreciate any suggestions on how to get around this. 我非常感谢有关如何解决这个问题的任何建议。

Edit: I've dug around more and I think this is due to my misunderstanding of the quantile method. 编辑:我挖了更多,我认为这是由于我对分位数方法的误解。 I created allquartile by this command: 我通过这个命令创建了allquartile

allquartile <-aggregate(list(s=topicquality$score), list(T=topicquality$T,method=topicquality$method),FUN=quantile,probs=seq(0, 1, .25)) 

And I realize that there are no columns named score.0% , score.25% , etc. There is just the score column with 5 values. 而且我意识到没有名为score.0%score.25%score.0%等。只有score列有5个值。 So this boils down to: how do i access those 5 values within score ? 所以这归结为:我如何在score访问这5个值?

SOLUTION

I've found the issue with my dataset. 我发现我的数据集存在问题。 As i mentioned in my edit, the columns score.0% , score.25% , etc didn't exist based on how i formed the data frame. 正如我在编辑中提到的,基于我如何形成数据框,列score.0%score.25%等不存在。 For example, running colnames(allquartile) returned: 例如,运行的colnames(allquartile)返回:

[1] "T"      "method" "score"

It turns out that the score column is a vector of values. 事实证明, score列是值的向量。 Running allquartile$score gives me: 运行allquartile$score给了我:

            0%       25%       50%       75%       100%
[1,] -196.7627 -190.3884 -184.0141 -177.6398 -171.26548
[2,] -171.5399 -166.1692 -160.7986 -115.2865  -69.77446
[3,] -161.1759 -157.6137 -149.7103 -124.6893  -69.77446
[4,] -194.1055 -179.8316 -175.1434 -168.4610 -159.07206
[5,] -200.1544 -174.2835 -167.7209 -145.3432 -129.54586

I can then access each individual quantile's values by doing 然后,我可以通过执行来访问每个单独的分位数值

> allquartile$score[,1]
[1] -196.7627 -171.5399 -161.1759 -194.1055 -200.1544

I'm not familiar with R enough to know what kind of data structure this is, but I would call it a matrix. 我不熟悉R足以知道这是什么类型的数据结构,但我称之为矩阵。 So like any good matrix object, m[,column] returns the values of the column while m[row,] returns the values of the row, and m[row, column] gets the cell value. 因此,与任何好的矩阵对象一样, m[,column]返回m[,column]的值,而m[row,]返回m[row,]的值, m[row, column]获取单元格值。

With that in mind, I've realized that the propper plotting command should be 考虑到这一点,我意识到应该使用propper plotting命令

p <- ggplot(allquartile,
            aes(x=T,
                ymin=score[,1],
                lower=score[,2],
                middle=score[,3],
                upper=score[,4], 
                ymax=score[,5], 
                color=method)) + 
     geom_boxplot(stat="identity") 

And this plots out everything perfectly. 这完美地描绘了一切。

Thanks to everyone for the good suggestions, even though they didn't fix the problem, they helped a lot in figuring things out. 感谢大家提出的好建议,即使他们没有解决问题,他们也帮助解决了很多问题。

Here is how to solve it. 这是如何解决它。 The issue is with your column names. 问题出在您的列名称上。 If you type names(allquartile) , you will notice that your column names are s.0. 如果键入names(allquartile) ,您会注意到列名是s.0. , s.25. s.25. etc. My recommendation would to be avoid all punctuations in column names save for _ or . 我的建议是避免列名中的所有标点符号,除了_. .

require(stringr)
names(allquartile) = str_replace_all(names(allquartile), "\\.", '')
p <- ggplot(allquartile2, aes_string(x = "T", ymin = "s0", lower = "s25", 
      middle = "s50", upper = "s75", ymax = "s100", color = "method")) + 
     geom_boxplot(stat = "identity")

Actually, based on your edits, I think your real problem is that you shouldn't have been using aggregate . 实际上,根据您的编辑,我认为您真正的问题是您不应该使用aggregate If the function you are applying returns multiple values (like quantile ), aggregate returns the results in the somewhat inconvenient format you observed, by default. 如果您要应用的函数返回多个值(如quantile ),则默认情况下, aggregate会以您观察到的某种不方便的格式返回结果。

What's happening is this. 这是怎么回事。 A data frame, somewhat confusingly, is actually a list, with each column being an element of the list. 数据框有点令人困惑,实际上是一个列表,每列都是列表的一个元素。 The only requirement being that each 'column' has the same number of rows. 唯一的要求是每个“列”具有相同的行数。 So you're getting a data frame back with three 'columns': the third column is a just a matrix! 所以你得到的数据框有三个'列':第三列只是一个矩阵!

Doing this with aggregate is possible, but there are more convenient tools out there. 使用aggregate执行此操作是可能的,但有更方便的工具。 (For instance, you could call cbind(allquartile[,1:2],allquartile[,3]) to create a data frame of the 'correct' dimensions.) (例如,您可以调用cbind(allquartile[,1:2],allquartile[,3])来创建“正确”维度的数据框。)

For example, a very popular one is ddply from the plyr package. 例如,一种非常流行的一种是ddplyplyr包。 Here's an example using some made up data, but following the general structure of your data: 以下是使用一些组成数据的示例,但遵循数据的一般结构:

topicquality <- data.frame(score = runif(20),
                            T = rep(letters[1:2],each = 10),
                            method = rep(letters[3:4],length.out = 20))

ddply(topicquality,.(T,method),FUN = function(x,...){quantile(x$score,...)},probs = seq(0,1,0.25))

You'll note that this will return a data frame of the dimensions you expect, but you still have to deal with the inconvenient column names. 您会注意到这将返回您期望的维度的数据框,但您仍然需要处理不方便的列名称。 That's best dealt with in the function you apply to each piece: 在您应用于每件作品的功能中,最好处理:

myQuantile <- function(x,...){
    tmp <- quantile(x,...)
    names(tmp) <- NULL #Or something else convenient
    tmp
}
ddply(topicquality,.(T,method),FUN = myQuantile,probs = seq(0,1,0.25))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM