简体   繁体   English

随着时间的推移,您将如何在 R 中为箱线图创建分类“箱”?

[英]How would you create categorical "bins" for a boxplot over time in R?

Been working on this and haven't been able to find a decent answer.一直在努力,但未能找到一个体面的答案。

Basically, I've got a dataset of NBA Player height vs draft year, and I am trying to create a boxplot to show how player height has changed overtime (this is for a hw assignment, so a boxplot is necessary).基本上,我有一个 NBA 球员身高与选秀年的数据集,我正在尝试创建一个箱线图来显示球员身高如何随着时间的推移而变化(这是用于硬件分配,所以箱线图是必要的)。 My dataset ( nba_data ) looks like the table below, but I have 10k rows ranging from players drafted in the 60s all the way to the 2000s.我的数据集 ( nba_data ) 如下表所示,但我有 10k 行,范围从 60 年代起草的球员一直到 2000 年代。

player_name选手姓名 draft_year年草稿 height_in height_in
player_a player_a 1998 1998 76 76
player_b播放器_b 1972 1972年 81 81
player_c播放器_c 2012 2012 79 79

So far the closest I've gotten is到目前为止,我得到的最接近的是

ggplot(data = nba_data, aes(x = draft_year, 
                            y = height_in, 
                            group = cut(x = draft_year, breaks = 5)))  + 
  geom_boxplot()

And this is the result I get.这就是我得到的结果。 As far as I understand, breaks being set to 5 should separate my years into 5 year buckets, right?据我了解,将休息时间设置为 5 应该将我的年份分为 5 年,对吗? 蹩脚的R箱线图

I created the same graph in Excel to get an idea of what it should look like:我在 Excel 中创建了相同的图表,以了解它的外观: 好的r图

I also attempted to create categories with cut, but was unable to apply it to my boxgraph.我还尝试使用 cut 创建类别,但无法将其应用于我的箱形图。 I mostly code in Python, but have to learn R for a class at school - any help is greatly appreciated.我主要在 Python 中编写代码,但必须在学校学习 R 以获得 class - 非常感谢任何帮助。

Thanks!谢谢!

Edit: Another question I guess would be how the "Undrafted" players would fit into this, since R seems to want to coerce the draft_year column as numerical to fit into a box plot.编辑:我猜另一个问题是“未选秀”球员如何适应这一点,因为 R 似乎想将 Draft_year 列强制为数字以适应框 plot。

From the ?cut help page, the breaks argument is:?cut帮助页面中, breaks参数是:

breaks
either a numeric vector of two or more unique cut points or a single number (greater than or equal to 2) giving the number of intervals into which x is to be cut.两个或多个唯一切割点的数字向量或单个数字(大于或等于 2)给出要切割x的间隔数。

You gave it a single number, so that's interpreted as the number of intervals.你给了它一个数字,所以它被解释为间隔的数量。

Instead, you should give it a vector of exact breakpoints, something like breaks = seq(1960, 2020, by = 5) .相反,你应该给它一个精确断点的向量,比如breaks = seq(1960, 2020, by = 5)

I'm surprised you think your axis is being numericized--it's definitely a continuous axis, but I've never heard of ggplot doing that to a string or factor input--check your data frame to make sure the "Undrafted" rows are really there, they might have gotten dropped or converted to NA at some point.我很惊讶你认为你的轴被数字化了——它绝对是一个连续轴,但我从未听说过ggplot对字符串或因子输入这样做——检查你的数据框以确保“未起草”行是真的在那里,他们可能在某个时候被丢弃或转换为NA But that's a good thing for cut , because cut will only work on numerics.但这对cut来说是件好事,因为cut只适用于数字。 I'd suggest cutting the column as numeric to create a bin column, and then replace NA s in the bin column with "Undrafted" .我建议将该列切割为数字以创建一个bin列,然后将bin列中的NA替换为"Undrafted"

If you don't mind using a package, you can get the effect you want with:如果您不介意使用 package,您可以通过以下方式获得您想要的效果:

library(santoku)

ggplot(..., aes(..., group = chop_width(draft_year, 5)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM