[英]How to use Dplyr's Summarize and which() to lookup min/max values
我有以下數據:
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
我想用dplyr
(1)按“組”將數據分組(2)顯示每個組中的最小和最大年齡(3)顯示具有最小和最大年齡的人的姓名
下面的代碼執行此操作:
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],
maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
哪個效果很好:
Group minAge minAgeName maxAge maxAgeName
1 A 22 Sam 22 Sam
2 B 12 Sarah 58 James
3 C 17 Andrew 82 Sally
4 D 12 Mairin 67 Ray
但是,如果有多個最小值或最大值,我就會遇到問題:
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
> data %>% group_by(Group) %>%
+ summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],
+ maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Error: expecting a single value
我正在尋找兩種解決方案:
(1)顯示哪個最小或最大名稱無關緊要,僅顯示一個(即找到的第一個值)(2)如果存在“聯系”,則顯示所有最小值和最大值
如果不清楚,請讓我知道,並提前致謝!
您可以使用which.min
和which.max
獲得第一個值。
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which.min(Age)],
maxAge = max(Age), maxAgeName = Name[which.max(Age)])
要獲取所有值,請使用例如粘貼適當的collapse
參數。
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "),
maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
我實際上建議您將數據保留為“長”格式。 這是我的處理方法:
library(dplyr)
有聯系時保留所有值:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
有聯系時僅保留一個值:
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
如果您真的想要一個“寬”數據集,則基本概念是使用“ tidyr”來gather
和spread
數據:
library(dplyr)
library(tidyr)
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(c(1, n())) %>%
mutate(minmax = c("min", "max")) %>%
gather(var, val, Name:Age) %>%
unite(key, minmax, var) %>%
spread(key, val)
# Source: local data frame [4 x 5]
#
# Group max_Age max_Name min_Age min_Name
# 1 A 22 Sam 22 Sam
# 2 B 58 James 31 Sarah
# 3 C 82 Sally 17 Andrew
# 4 D 67 Ray 12 Mairin
盡管您想要聯系的廣泛形式還不清楚。
這是一些data.table
方法,第一個是從@akrun借來的:
setDT(data)
# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
# Group min.Name min.Age max.Name max.Age
# 1: A Sam 22 Sam 22
# 2: B Sarah 31 James 58
# 3: C Andrew 17 Sally 82
# 4: D Mairin 12 Ray 67
# show all, long format
data[,{
mina=min(Age)
maxa=max(Age)
rbind(
data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
# Group minmax Age Name
# 1: A min 22 Sam
# 2: A max 22 Sam
# 3: B min 31 Sarah
# 4: B min 31 Jim
# 5: B max 58 James
# 6: C min 17 Andrew
# 7: C max 82 Sally
# 8: D min 12 Mairin
# 9: D max 67 Ray
我認為長格式是最好的,因為它允許您使用minmax
進行過濾,但是代碼很折磨且效率低下。
這里有一些不太好的方法:
# show all, wide format (with a list column)
data[,{
mina=min(Age)
maxa=max(Age)
list(
minAge=mina,
maxAge=maxa,
minNames=list(Name[Age==mina]),
maxNames=list(Name[Age==maxa]))
},by=Group]
# Group minAge maxAge minNames maxNames
# 1: A 22 22 Sam Sam
# 2: B 31 58 Sarah,Jim James
# 3: C 17 82 Andrew Sally
# 4: D 12 67 Mairin Ray
# show all, wide format (with a string column)
# (just look at @shadow's answer)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.