简体   繁体   English

使用ggplot2可视化测试和训练集分布

[英]Visualize test and training set distribution with ggplot2

I am trying to visualize the distribution of a dataset and it's splits into test and training data to check if the split is stratified. 我正在尝试可视化数据集的分布,并将其拆分为测试数据和训练数据,以检查拆分是否分层。

The minimal example uses the iris dataset. 最小示例使用虹膜数据集。 It has a species column which is a factor with 3 levels. 它具有一个species列,该列是3个级别的因子。 The following code snippet will show a nice plot with the count for each label, however I would like to see the percentage/probability for the labels in the respective set to see the distribution of the training and test sets. 下面的代码片段将显示一个带有每个标签计数的漂亮图表,但是我想查看各个集合中标签的百分比/概率,以查看训练集和测试集的分布。

library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]

iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"

ggplot(iris, aes(x = Species, fill = Set)) + geom_bar(position = "dodge")

虹膜数据集训练和测试样本计数

I tried calculating the percentage as shown below however this does not work, because it shows the percentage of the whole dataframe which shows a distribution similar to the counts. 我尝试计算如下所示的百分比,但是这不起作用,因为它显示了整个数据框的百分比,该百分比显示了与计数相似的分布。

geom_bar(aes(y = (..count..)/sum(..count..)))

How can I plot the percentage of each label within each set efficiently? 如何有效地绘制每个标签在每个标签组中的百分比?

Bonus: Including the whole dataset, train and test. 奖励:包括整个数据集,训练和测试。

library("tidyverse")
data(iris)
n = nrow(iris)
idxTrain <- sample(1:n, size = round(0.7*n), replace = F)
train <- iris[idxTrain,]
test <- iris[-idxTrain,]

iris$Set <- rep("Train", time = nrow(iris))
iris$Set[-idxTrain] <- "Test"

you need a separate dataframe for the labels 标签需要一个单独的数据框

df_labs <- 
 iris %>% 
 group_by(Species) %>% 
 count(Set) %>% 
 mutate(pct = n / sum(n)) %>% 
 filter(Set == "Test")

that you use as the data for the label geom (or text) 用作标签几何(或文本)的数据

ggplot(iris, aes(x = Species, fill = Set)) + 
  geom_bar(position = "dodge") +
  geom_label(data = df_labs, aes(label = scales::percent(pct), y = n / 2))

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM