[英]For all levels of a factor, return all levels of another factor from same dataframe - using dplyr ? r
我有一個非常大的數據集,其中包含歷史足球成績。 這是其中的一部分:
Season home visitor FT
1954 Aston Villa SHW 0-0
1956 Aston Villa SHW 5-0
1957 Aston Villa SHW 2-0
1960 Aston Villa SHW 4-1
1987 Aston Villa HUL 5-0
1987 Aston Villa HUD 1-1
1987 Aston Villa BLB 1-1
1933 Preston North End NOT 4-0
1958 Preston North End NOT 3-5
1960 Preston North End NOT 0-1
1962 Preston North End SWA 6-3
1976 Walsall SHW 5-1
1977 Walsall SHW 1-1
2002 Walsall Sheffield United 0-1
2002 Walsall Gillingham 1-0
對於每個主隊(因素),我希望返回該因素發生的另一個因素(季節)的唯一水平。 在上面的示例中,它將返回:
Aston Villa - 1954, 1956, 1957, 1960, 1987
Preston North End - 1933, 1958, 1960, 1962
Walsall - 1976, 1977, 2002
我考慮過要嘗試在dplyr中執行此操作。 但是,我做錯了。
我嘗試了這個:
library(dplyr)
demodf%>%
group_by(home)%>%
summarize(levels(Season))
#Error: expecting a single value
出於興趣,我做了以下事情,看看是否可以看到每個因素/主隊的第一年回報:
demodf%>%
group_by(home)%>%
summarize(levels(Season)[1])
這給了我這個:
# home levels(Season)[1]
#1 Aston Villa 1933
#2 Preston North End 1933
#3 Walsall 1933
這是不對的-它剛剛返回了整個數據幀(1933)中第一季度的季節因子,而不是分別返回每個團隊的第一年/季節因子的水平-我認為group.by
會幫助獲得在這。
我對此表示感謝。
下面應該使您能夠復制上表:
demodf<-structure(list(Season = structure(c(2L, 3L, 4L, 6L, 10L, 10L,
10L, 1L, 5L, 6L, 7L, 8L, 9L, 11L, 11L), .Label = c("1933", "1954",
"1956", "1957", "1958", "1960", "1962", "1976", "1977", "1987",
"2002"), class = "factor"), home = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("Aston Villa",
"Preston North End", "Walsall"), class = "factor"), visitor = structure(c(7L,
7L, 7L, 7L, 4L, 3L, 1L, 5L, 5L, 5L, 8L, 7L, 7L, 6L, 2L), .Label = c("BLB",
"Gillingham", "HUD", "HUL", "NOT", "Sheffield United", "SHW",
"SWA"), class = "factor"), FT = structure(c(1L, 9L, 5L, 8L, 9L,
4L, 4L, 7L, 6L, 2L, 11L, 10L, 4L, 2L, 3L), .Label = c("0-0",
"0-1", "1-0", "1-1", "2-0", "3-5", "4-0", "4-1", "5-0", "5-1",
"6-3"), class = "factor")), .Names = c("Season", "home", "visitor",
"FT"), row.names = c(NA, -15L), class = "data.frame")
在這種情況下,您可以使用by
:
with(demodf, by(Season, home, unique))
# home: Aston Villa
# [1] 1954 1956 1957 1960 1987
# Levels: 1933 1954 1956 1957 1958 1960 1962 1976 1977 1987 2002
# ------------------------------------------------------------
# home: Preston North End
# [1] 1933 1958 1960 1962
# Levels: 1933 1954 1956 1957 1958 1960 1962 1976 1977 1987 2002
# ------------------------------------------------------------
# home: Walsall
# [1] 1976 1977 2002
# Levels: 1933 1954 1956 1957 1958 1960 1962 1976 1977 1987 2002
“ data.table”包還可以將list
s作為data.table
列來data.table
,如下所示:
library(data.table)
DT <- as.data.table(demodf)
DT[, list(Season = list(unique(Season))), by = home]
# home Season
# 1: Aston Villa 1954,1956,1957,1960,1987
# 2: Preston North End 1933,1958,1960,1962
# 3: Walsall 1976,1977,2002
注意結果的結構:
str(.Last.value)
# Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ home : Factor w/ 3 levels "Aston Villa",..: 1 2 3
# $ Season:List of 3
# ..$ : Factor w/ 11 levels "1933","1954",..: 2 3 4 6 10
# ..$ : Factor w/ 11 levels "1933","1954",..: 1 5 6 7
# ..$ : Factor w/ 11 levels "1933","1954",..: 8 9 11
# - attr(*, ".internal.selfref")=<externalptr>
但是,將Season
作為因素會使事情變得復雜些
demodf %>% group_by(home) %>% do(data.frame(Seasons = unique(.$Season)))
將工作。
請注意,使用unique
而不是levels
更簡單
我使用粘貼來模仿您想要的輸出:
demodf%>%
group_by(home)%>%
summarise( summary = paste(unique(Season),collapse=","))
這使
home summary
1 Aston Villa 1954,1956,1957,1960,1987
2 Preston North End 1933,1958,1960,1962
3 Walsall 1976,1977,2002
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.