[英]In R, how to split/subset a data frame by factors in more than one column?
我的數據如下:
ID Test Type Subject Marks
1 Unit test 1 English 85
2 Unit test 1 English 75
3 Unit test 1 English 78
1 Unit test 2 English 85
2 Unit test 2 English 75
3 Unit test 2 English 78
1 Unit test 1 Maths 78
2 Unit test 1 Maths 79
3 Unit test 1 Maths 98
1 Unit test 2 Maths 95
2 Unit test 2 Maths 98
3 Unit test 2 Maths 88
我想按“測試類型”和“主題”拆分數據。我應該使用什么功能? 我想要的結果是:
data frame 1:
ID Test Type Subject Marks
1 Unit test 1 English 85
2 Unit test 1 English 75
3 Unit test 1 English 78
data frame 2:
ID Test Type Subject Marks
1 Unit test 2 English 85
2 Unit test 2 English 75
3 Unit test 2 English 78
data frame 3 :
ID Test Type Subject Marks
1 Unit test 1 Maths 78
2 Unit test 1 Maths 79
3 Unit test 1 Maths 98
data frame 4:
ID Test Type Subject Marks
1 Unit test 2 Maths 95
2 Unit test 2 Maths 98
3 Unit test 2 Maths 88
您可以使用split()
(感謝DrDom的改進)。
split(df, list(df$Test.Type, df$Subject))
# $`Unit test 1.English`
# ID Test.Type Subject Marks
# 1 1 Unit test 1 English 85
# 2 2 Unit test 1 English 75
# 3 3 Unit test 1 English 78
#
# $`Unit test 2.English`
# ID Test.Type Subject Marks
# 4 1 Unit test 2 English 85
# 5 2 Unit test 2 English 75
# 6 3 Unit test 2 English 78
#
# $`Unit test 1.Maths`
# ID Test.Type Subject Marks
# 7 1 Unit test 1 Maths 78
# 8 2 Unit test 1 Maths 79
# 9 3 Unit test 1 Maths 98
#
# $`Unit test 2.Maths`
# ID Test.Type Subject Marks
# 10 1 Unit test 2 Maths 95
# 11 2 Unit test 2 Maths 98
# 12 3 Unit test 2 Maths 88
其中df
是原始數據。
df <- structure(list(ID = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L,
2L, 3L), Test.Type = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 1L, 2L, 2L, 2L), .Label = c("Unit test 1", "Unit test 2"), class = "factor"),
Subject = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("English", "Maths"), class = "factor"),
Marks = c(85L, 75L, 78L, 85L, 75L, 78L, 78L, 79L, 98L, 95L,
98L, 88L)), .Names = c("ID", "Test.Type", "Subject", "Marks"
), class = "data.frame", row.names = c(NA, -12L))
另一種簡單的解決方案是使用by
:
list.df <- by(df, INDICES = list(df$Test.Type, df$Subject), FUN = data.frame)
結果
> list.df
: Unit test 1
: English
ID Test.Type Subject Marks
1 1 Unit test 1 English 85
2 2 Unit test 1 English 75
3 3 Unit test 1 English 78
--------------------------------------------------------------------------------------------------
: Unit test 2
: English
ID Test.Type Subject Marks
4 1 Unit test 2 English 85
5 2 Unit test 2 English 75
6 3 Unit test 2 English 78
--------------------------------------------------------------------------------------------------
: Unit test 1
: Maths
ID Test.Type Subject Marks
7 1 Unit test 1 Maths 78
8 2 Unit test 1 Maths 79
9 3 Unit test 1 Maths 98
--------------------------------------------------------------------------------------------------
: Unit test 2
: Maths
ID Test.Type Subject Marks
10 1 Unit test 2 Maths 95
11 2 Unit test 2 Maths 98
12 3 Unit test 2 Maths 88
然后,您可以使用list.df[[1]]
至list.df[[4]]
訪問每個單獨的數據list.df[[1]]
。
( dput
Richard Scriven在他的回答中介紹了數據。)
這是將計算每種測試類型/科目組合的平均成績的代碼:
# df($testtype, $subject)
> ddply(df, .(testtype, subject), summarize, avgmark = round(mean(marks), 0))
結果:
testtype subject avgmark
1 Unit Test 1 English 79
2 Unit Test 1 Maths 85
3 Unit Test 2 English 79
4 Unit Test 2 Maths 94
ddply
函數將為每個組計算avgmark
並返回數據幀結果。 您可以將avgmark
替換為avgmark
的任何聚合函數。 您也可以在avgmark
之后添加更多聚合函數。 請查看本文以獲取更多信息。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.