[英]Using geom_boxplot yields different result than base boxplot()
I'm using the gapminder dataset to practice some basic data analysis on the data frame.我正在使用 gapminder 数据集对数据框进行一些基本的数据分析。 I want to create a subset of this data with only Argentina and New Zealand, in order to compare their values.
我想只用阿根廷和新西兰创建这个数据的一个子集,以便比较它们的值。
install.packages("gapminder")
library(gapminder)
data("gapminder")
> gapminder
# A tibble: 1,704 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ... with 1,694 more rows
I'm subsetting the information I want like so:我正在对我想要的信息进行子集化:
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
> df
# A tibble: 24 x 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Argentina Americas 1952 62.5 17876956 5911.
2 Argentina Americas 1957 64.4 19610538 6857.
3 Argentina Americas 1962 65.1 21283783 7133.
4 Argentina Americas 1967 65.6 22934225 8053.
5 Argentina Americas 1972 67.1 24779799 9443.
6 Argentina Americas 1977 68.5 26983828 10079.
7 Argentina Americas 1982 69.9 29341374 8998.
8 Argentina Americas 1987 70.8 31620918 9140.
9 Argentina Americas 1992 71.9 33958947 9308.
10 Argentina Americas 1997 73.3 36203463 10967.
# ... with 14 more rows
This works great as you can see (or that's what it seems)如您所见,这很有效(或者看起来就是这样)
Now I would like to create a simple boxplot to quickly analyze some values, but when I plot this with boxplot() and geom_boxplot I get two different results:现在我想创建一个简单的箱线图来快速分析一些值,但是当我使用 boxplot() 和 geom_boxplot 进行 plot 时,我得到了两个不同的结果:
boxplot(lifeExp ~ country)
This is what I want, but the x axis is also taking into account all the other countries I did not select.这就是我想要的,但是 x 轴也考虑了我没有 select 的所有其他国家。 Clearly their data is null but it makes the plot unreadable.
很明显,他们的数据是 null 但它使 plot 不可读。
Instead if I use the same data and everything on ggplot, then it works perfectly:相反,如果我在 ggplot 上使用相同的数据和所有内容,那么它可以完美运行:
ggplot(data = df, mapping = aes(x=country, y=lifeExp)) + geom_boxplot()
Is there something wrong I'm doing while defining the subset?在定义子集时我做错了什么吗? Using boxplot() gives me the impression that the subset is keeping everything but putting the values for the things I don't want to NULL.
使用 boxplot() 给我的印象是子集保留了所有内容,但将我不想要的东西的值放在 NULL 中。
Start with the code posted in the question.从问题中发布的代码开始。
library(gapminder)
data("gapminder")
df <- subset(gapminder, country =="Argentina" | country == "New Zealand")
boxplot(lifeExp ~ country, df)
The plot shows space for all countries because country
is a factor and subsetting keeps its original levels. plot 显示所有国家/地区的空间,因为
country
/地区是一个因素,子集保持其原始水平。 With str
, it can be seen what df
is:使用
str
,可以看出df
是什么:
str(df)
#tibble [24 × 6] (S3: tbl_df/tbl/data.frame)
# $ country : Factor w/ 142 levels "Afghanistan",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ year : int [1:24] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num [1:24] 62.5 64.4 65.1 65.6 67.1 ...
# $ pop : int [1:24] 17876956 19610538 21283783 22934225 24779799 26983828 29341374 31620918 33958947 36203463 ...
# $ gdpPercap: num [1:24] 5911 6857 7133 8053 9443 ...
The factor country
has 142 levels.因子
country
有142个级别。
The solution is to drop the extra levels.解决方案是删除额外的级别。
df2 <- df
df2$country <- droplevels(df2$country)
boxplot(lifeExp ~ country, df2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.