简体   繁体   English

R boxplot总结

[英]R boxplot over summary

From the (simplified) data below that represents a user choosing between three options, I want to create a set of boxplots of the percentage of times a user chose a value, based upon the factor of value. 从下面的(简化的)数据中,该数据代表用户在三个选项之间进行选择,我想基于值的因素来创建一组框形图,以显示用户选择值的次数百分比。 So I want three boxplots, the percentage users chose 0, 1 and 2. 所以我要三个框图,用户选择的百分比为0、1和2。

I'm sure I'm missing something obvious, as I often do with R. I can get the percentages using by(dat, dat$user, function(user) {table(user$value)/length(user$value)*100}) , but don't know how to turn that into boxplots. 我确定我缺少一些明显的东西,就像我经常使用R一样。我可以使用by(dat, dat$user, function(user) {table(user$value)/length(user$value)*100}) ,但不知道如何将其转换为箱线图。

Hope that makes sense. 希望有道理。

user|value
1|2
1|1
1|0
1|2
1|0
2|2
2|2
2|2
2|0
2|2
3|2
3|0
3|1
3|0
3|1
4|2
4|0
4|1
4|0
4|1
5|2
5|0
5|1
5|0
5|1
6|2
6|0
6|0
6|1
6|2
7|0
7|0
7|1
7|0
7|1
8|2
8|2
8|1
8|1
8|2
9|1
9|0
9|0
9|0
9|0
10|1
10|2
10|0
10|2
10|1

I would approach creating the summary using the plyr package. 我将使用plyr软件包创建摘要。 First, you should convert value to a factor, so that when some user never picked some value, that value will have 0%. 首先,您应该将value转换为因数,以便当某些用户从不选择某个值时,该值将为0%。

dat$value <- factor(dat$value)

Now, you write your summary function that takes a data frame (technically this step can be smushed into the next step, but this way it's more legible). 现在,您编写需要一个数据框的摘要函数(从技术上讲,此步骤可以拖入下一步,但这样更易读)。

p.by.user <- function(df){
  data.frame(prop.table(table(df$value)))
}

Then, apply this function to every subset of dat defined by user . 然后,将此功能应用于user定义的dat每个子集。

dat.summary <- ddply(dat, .(user), p.by.user)

A base graphics boxplot of this data would be done like this. 此数据的基本图形箱图将按以下方式完成。

with(dat.summary, boxplot(Freq ~ Var1, ylim = c(0,1)))

If you don't mind my two cents, I don't know that boxplots are the right way to go with this kind of data. 如果您不介意我的两分钱,我不知道箱线图是处理此类数据的正确方法。 This isn't very dense data (if your sample is realistic), and boxplots don't capture the dependency between decisions. 这不是非常密集的数据(如果您的样本是真实的),并且箱线图无法捕获决策之间的依赖关系。 That is, if some user chose 1 super frequently, then they must have chosen the other much less frequently. 即,如果某个用户频繁选择1超级用户,则他们必须少选择另一个超级用户。

You could try a filled bar chart for each user, and it wouldn't require any pre-summarization if you use ggplot2 . 您可以为每个用户尝试一个填充的条形图,如果使用ggplot2则不需要任何预先汇总。 The code would look like this 代码看起来像这样

ggplot(dat, aes(factor(user), fill = value)) + geom_bar()
    # or, to force the range to be between 0 and 1
    # + geom_bar(position = "fill")

Is something like this what you're looking for? 您正在寻找类似这样的东西吗?

user <- rep(1:10,each=5)
value <- sample(0:2,50,replace=T)
dat <- data.frame(user,value)

percent <- unlist(
    by(dat, dat$user,
        function(user) {
            table(user$value)/length(user$value)*100
        }
    )
)

# make a vector with all percentages
percent <- unlist(percent)
# extract the necessary info from the names
value <- gsub("\\d+\\.(\\d)","\\1",names(percent))

boxplot(percent~value)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM