简体   繁体   English

R 数据帧组织

[英]R data frame organisation

I'd like to analyse a sequence of rowing races in R where boats with 4 rowers each race pairwise against each other.我想分析 R 中的一系列赛艇比赛,其中每艘有 4 名赛艇运动员的船只成对比赛。 I wonder about the best way to represent this in a data frame.我想知道在数据框中表示它的最佳方式。 I currently have 12 timed events, 2 such events constitute a race between two boats.我目前有 12 个计时项目,其中 2 个这样的项目构成了两艘船之间的比赛。

     time race boat seat1 seat2 seat3 seat4
1  204.98    1    1     2     6     1     5
2  202.49    2    1     4     5     2     7
3  202.27    3    1     2     6     3     7
4  206.48    4    1     1     7     2     8
5  204.85    5    1     4     8     2     6
6  204.93    6    1     2     8     3     5
7  204.91    1    2     3     7     4     8
8  207.40    2    2     1     8     3     6
9  207.62    3    2     1     5     4     8
10 203.41    4    2     3     5     4     6
11 205.04    5    2     3     7     1     5
12 204.96    6    2     4     6     1     7

Here the numbers in the seat columns refer to rowers (so there are 8 of them) but it would be more natural to use names or letters.这里座位列中的数字是指划船者(因此有 8 个),但使用名称或字母会更自然。 I need to extract a 12x8 matrix that captures which rower participated in which event.我需要提取一个 12x8 矩阵来捕获哪个赛艇运动员参与了哪个事件。

The code below builds the data frame above:下面的代码构建了上面的数据框:

df <- data.frame ( 
                  time = c(204.98, 202.49, 202.27, 206.48, 204.85, 204.93,
                           204.91, 207.40, 207.62, 203.41, 205.04, 204.96),
                  race = append(1:6, 1:6),
                  boat = append(rep(1,6),rep(2,6)),
                  seat1 = c(2,4,2,1,4,2, 3,1,1,3,3,4),
                  seat2 = c(6,5,6,7,8,8, 7,8,5,5,7,6),
                  seat3 = c(1,2,3,2,2,3, 4,3,4,4,1,1),
                  seat4 = c(5,7,7,8,6,5, 8,6,8,6,5,7))

  1. To extract the relation between rowers and events, would it be better to organise this differently?为了提取赛艇运动员和赛事之间的关系,以不同的方式组织会更好吗?
  2. Would it be natural to capture additional facts about rowers (like their weight, age) in a separate data frame or is it better (how?) to keep everything in one data frame.在单独的数据框中捕获有关赛艇运动员的其他事实(例如他们的体重、年龄)是自然的,还是将所有内容保存在一个数据框中更好(如何?)。

It seems there is a tradeoff between redundancy and convenience.冗余和便利之间似乎存在权衡。 Whereas in a relational database one would use several relations it appears the R community prefers to share data in a single data frame.虽然在关系数据库中会使用多个关系,但 R 社区似乎更喜欢在单个数据框中共享数据。 I am sure there is always a way to make it work but lacking the experience I'd be curious how experienced R users would organise the data.我相信总有办法让它工作,但缺乏经验我很好奇有经验的 R 用户将如何组织数据。

Addendum: Lots of answers highlight the importance of the questions.附录:很多答案突出了问题的重要性。 Here is one that would benefit from bringing data into matrix form: the total time a rower spent in races: a vector of event times and a {0,1} valued matrix that connects events and rowers mentioned before.以下是将数据转换为矩阵形式的好处:赛艇运动员在比赛中花费的总时间:一个事件时间向量和一个连接前面提到的事件和赛艇运动员的 {0,1} 值矩阵。 The result could be obtained by multiplying them.结果可以通过将它们相乘来获得。

This is certainly a matter of opinion (totally agree with @MattB).这当然是一个见仁见智的问题(完全同意@MattB)。 Data frames are a very convenient way for many statistical analyses but many times you have to transform them to fit your purpose.数据框是许多统计分析的一种非常方便的方法,但很多时候您必须对其进行转换以适应您的目的。

Your case shows a data frame in "wide form".您的案例以“宽格式”显示数据框。 I see no convenient way to add more facts about rowers.我认为没有方便的方法来添加更多关于赛艇运动员的事实。 I would transform it to "long form".我会将其转换为“长格式”。 In the wide form each rower gets their own row.在宽式中,每个划船者都有自己的划船。 And since the rowers seem to be your "object of interest" (your cases) that could probably make things easier.而且由于赛艇运动员似乎是您的“感兴趣的对象”(您的案件),这可能会使事情变得更容易。 The question "which races did rower 4 take part in?"问题“4 号桨手参加了哪些比赛?” could be answered easily with that form.可以很容易地用那种形式回答。

To create a table of events vs. rowers melt the data into long form m and then back into the appropriate wide form.创建一个事件表与赛艇运动员将数据融合为长格式m ,然后再转换为适当的宽格式。 There is no reason you can't have the data in multiple forms so it is really not necessary to choose the best forms.没有理由你不能拥有多个 forms 中的数据,因此实际上没有必要选择最好的 forms。 You can always regenerate them if new data comes in. The form of interest really depends on what you want to do with it but the code below gives you three forms:如果有新数据进入,您始终可以重新生成它们。感兴趣的形式实际上取决于您想用它做什么,但下面的代码为您提供了三个 forms:

  1. the original wide form df ,原始宽格式df
  2. the long form m which could be useful for regression, boxplots, etc. eg长形m可用于回归、箱线图等。例如

    lm(time ~ factor(rower) + 0, m) boxplot(time ~ boat, m)
  3. the revised wide form df2 .修改后的宽格式df2

If there exists rower specific attributes then those could be stored in a separate data frame with one row per rower and one column per attribute and depending on what you want to do could be merged with m using merge if you want to use those in a regression, say.如果存在特定于划船器的属性,则可以将这些属性存储在单独的数据框中,每个划船器一行,每个属性一列,如果您想在回归中使用这些属性,则可以使用mergem合并, 说。

library(data.table)

m <- melt(as.data.table(df), id = 1:3, value.name = "rower")
df2 <- dcast(data = m, time + race + boat ~ rower, value.var = "rower")
setkey(df2, boat, race) # sort
df2

giving:给予:

      time race boat  1  2  3  4  5  6  7  8
 1: 204.98    1    1  1  2 NA NA  5  6 NA NA
 2: 202.49    2    1 NA  2 NA  4  5 NA  7 NA
 3: 202.27    3    1 NA  2  3 NA NA  6  7 NA
 4: 206.48    4    1  1  2 NA NA NA NA  7  8
 5: 204.85    5    1 NA  2 NA  4 NA  6 NA  8
 6: 204.93    6    1 NA  2  3 NA  5 NA NA  8
 7: 204.91    1    2 NA NA  3  4 NA NA  7  8
 8: 207.40    2    2  1 NA  3 NA NA  6 NA  8
 9: 207.62    3    2  1 NA NA  4  5 NA NA  8
10: 203.41    4    2 NA NA  3  4  5  6 NA NA
11: 205.04    5    2  1 NA  3 NA  5 NA  7 NA
12: 204.96    6    2  1 NA NA  4 NA  6  7 NA

Alternately, with dplyr/tidyr:或者,使用 dplyr/tidyr:

library(dplyr)
library(tidyr)

m <- df %>%
  pivot_longer(-(1:3), names_to = "seat", values_to = "rower")
df2 <- m %>% 
  pivot_wider(1:3, names_from = rower, values_from = rower, names_sort = TRUE)

This is going to be a matter of opinion and will depend in part on what sort of questions you will want to ask of this dataset.这将是一个见仁见智的问题,部分取决于您想对该数据集提出什么样的问题。 For example, the question "which races did rower 4 take part in?"例如,“4 号划手参加了哪些比赛?”这个问题。 is not easily answered with the format above.上面的格式不容易回答。

For that reason I would lean towards:出于这个原因,我倾向于:

  • A table of races, much like you have, but without the seat* columns;一个比赛表,就像你有的一样,但没有座位*列;
  • A table of rowers, where additional details (name, weight, etc.) can be kept;一个赛艇运动员表,可以保存其他详细信息(姓名、体重等); and
  • A table linking the two, with one row per rower per race.一张将两者连接起来的桌子,每场比赛每个赛艇运动员一排。

This would avoid most redundancy and allow most questions (that I can think of.) to be answered relatively straightforwardly, You can always have a function (using. eg,, dcast ) to recreate the form you show above for human-readability.这将避免大多数冗余并允许相对简单地回答大多数问题(我能想到的)。您始终可以使用 function(例如,使用dcast )来重新创建上面显示的表单以供人类阅读。

No disagreement that it depends on the questions.没有异议,这取决于问题。 But I suspect in your case a lot will be answered from long format and that will also make it easy to attach additional rower information when and if needed.但我怀疑在你的情况下,很多东西都会从长格式中得到解答,这也使得在需要时附加额外的赛艇运动员信息变得容易。

library(dplyr)
library(tidyr)

my_way <- pivot_longer(df, starts_with("seat"), values_to = "rower", names_to = "seat")
my_way
#> # A tibble: 48 x 5
#>     time  race  boat seat  rower
#>    <dbl> <int> <dbl> <chr> <dbl>
#>  1  205.     1     1 seat1     2
#>  2  205.     1     1 seat2     6
#>  3  205.     1     1 seat3     1
#>  4  205.     1     1 seat4     5
#>  5  202.     2     1 seat1     4
#>  6  202.     2     1 seat2     5
#>  7  202.     2     1 seat3     2
#>  8  202.     2     1 seat4     7
#>  9  202.     3     1 seat1     2
#> 10  202.     3     1 seat2     6
#> # … with 38 more rows

my_way %>% group_by(rower) %>% summarise(mean(time))
#> # A tibble: 8 x 2
#>   rower `mean(time)`
#>   <dbl>        <dbl>
#> 1     1         206.
#> 2     2         204.
#> 3     3         205.
#> 4     4         205.
#> 5     5         205.
#> 6     6         205.
#> 7     7         204.
#> 8     8         206.

my_way %>% group_by(rower, seat) %>% summarise()
#> # A tibble: 16 x 2
#> # Groups:   rower [8]
#>    rower seat 
#>    <dbl> <chr>
#>  1     1 seat1
#>  2     1 seat3
#>  3     2 seat1
#>  4     2 seat3
#>  5     3 seat1
#>  6     3 seat3
#>  7     4 seat1
#>  8     4 seat3
#>  9     5 seat2
#> 10     5 seat4
#> 11     6 seat2
#> 12     6 seat4
#> 13     7 seat2
#> 14     7 seat4
#> 15     8 seat2
#> 16     8 seat4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM