简体   繁体   English

使用 geom_boxplot 产生与基本 boxplot() 不同的结果

[英]Using geom_boxplot yields different result than base boxplot()

I'm using the gapminder dataset to practice some basic data analysis on the data frame.我正在使用 gapminder 数据集对数据框进行一些基本的数据分析。 I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.我想只用阿根廷和新西兰创建这个数据的一个子集,以便比较它们的值。

install.packages("gapminder")
library(gapminder)
data("gapminder")

    > gapminder
# A tibble: 1,704 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ... with 1,694 more rows

I'm subsetting the information I want like so:我正在对我想要的信息进行子集化:

df <- subset(gapminder, country =="Argentina" | country == "New Zealand")

> df
# A tibble: 24 x 6
   country   continent  year lifeExp      pop gdpPercap
   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
 1 Argentina Americas   1952    62.5 17876956     5911.
 2 Argentina Americas   1957    64.4 19610538     6857.
 3 Argentina Americas   1962    65.1 21283783     7133.
 4 Argentina Americas   1967    65.6 22934225     8053.
 5 Argentina Americas   1972    67.1 24779799     9443.
 6 Argentina Americas   1977    68.5 26983828    10079.
 7 Argentina Americas   1982    69.9 29341374     8998.
 8 Argentina Americas   1987    70.8 31620918     9140.
 9 Argentina Americas   1992    71.9 33958947     9308.
10 Argentina Americas   1997    73.3 36203463    10967.
# ... with 14 more rows

This works great as you can see (or that's what it seems)如您所见,这很有效(或者看起来就是这样)

Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:现在我想创建一个简单的箱线图来快速分析一些值,但是当我使用 boxplot() 和 geom_boxplot 进行 plot 时,我得到了两个不同的结果:

boxplot(lifeExp ~ country)

在此处输入图像描述

This is what I want, but the x axis is also taking into account all the other countries I did not select.这就是我想要的,但是 x 轴也考虑了我没有 select 的所有其他国家。 Clearly their data is null but it makes the plot unreadable.很明显,他们的数据是 null 但它使 plot 不可读。

Instead if I use the same data and everything on ggplot, then it works perfectly:相反,如果我在 ggplot 上使用相同的数据和所有内容,那么它可以完美运行:

ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()

在此处输入图像描述

Is there something wrong I'm doing while defining the subset?在定义子集时我做错了什么吗? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.使用 boxplot() 给我的印象是子集保留了所有内容,但将我不想要的东西的值放在 NULL 中。

Start with the code posted in the question.从问题中发布的代码开始。

library(gapminder)
data("gapminder")

df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)

The plot shows space for all countries because country is a factor and subsetting keeps its original levels. plot 显示所有国家/地区的空间,因为country /地区是一个因素,子集保持其原始水平。 With str , it can be seen what df is:使用str ,可以看出df是什么:

str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country  : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year     : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp  : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop      : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...

The factor country has 142 levels.因子country有142个级别。
The solution is to drop the extra levels.解决方案是删除额外的级别。

df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM