![](/img/trans.png)
[英]R dplyr - select values from one column based on position of a specific value in another column
[英]Summing values in R based on column value with dplyr
我有一個包含以下信息的數據集:
Subject Value1 Value2 Value3 UniqueNumber
001 1 0 1 3
002 0 1 1 2
003 1 1 1 1
如果UniqueNumber的值> 0,我想將dplyr的值與第1行到UniqueNumber中的每個主題相加並計算均值。 因此對於Subject 001,sum = 2並且mean = .67。
total = 0;
average = 0;
for(i in 1:length(Data$Subject)){
for(j in 1:ncols(Data)){
if(Data$UniqueNumber[i] > 0){
total[i] = sum(Data[i,1:j])
average[i] = mean(Data[i,1:j])
}
}
編輯:我只想查看“UniqueNumber”列中列出的列數。 所以這循環遍歷每一行並停在'UniqueNumber'中列出的列。 示例:帶有Subject 002的第2行應該將“Value1”和“Value2”列中的值相加,而帶有Subject 003的第3行應該只對“Value1”列中的值求和。
不是一個整齊的粉絲/專家,但我會嘗試使用長格式。 然后,只按每個組的行索引進行過濾,然后在單個列上運行您想要的任何函數(這樣更容易)。
library(tidyr)
library(dplyr)
Data %>%
gather(variable, value, -Subject, -UniqueNumber) %>% # long format
group_by(Subject) %>% # group by Subject in order to get row counts
filter(row_number() <= UniqueNumber) %>% # filter by row index
summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
ungroup()
## A tibble: 3 x 3
# Subject Mean Total
# <int> <dbl> <int>
# 1 1 0.667 2
# 2 2 0.5 1
# 3 3 1 1
實現此目的的一種非常類似的方法可能是通過列名中的整數進行過濾。 過濾器步驟在group_by
之前,所以它可能會提高性能(或不是?)但是它不那么健壯,因為我假設感興趣的cols被稱為"Value#"
Data %>%
gather(variable, value, -Subject, -UniqueNumber) %>% #long format
filter(as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber) %>% #filter
group_by(Subject) %>% # group by Subject
summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
ungroup()
## A tibble: 3 x 3
# Subject Mean Total
# <int> <dbl> <int>
# 1 1 0.667 2
# 2 2 0.5 1
# 3 3 1 1
只是為了好玩,添加一個data.table解決方案
library(data.table)
data.table(Data) %>%
melt(id = c("Subject", "UniqueNumber")) %>%
.[as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber,
.(Mean = round(mean(value), 3), Total = sum(value)),
by = Subject]
# Subject Mean Total
# 1: 1 0.667 2
# 2: 2 0.500 1
# 3: 3 1.000 1
這是另一種使用tidyr::nest
將Values
列收集到列表中的方法,以便我們可以使用map2
遍歷表。 在每一行中,我們從Values
list-col中選擇正確的值,並分別取總和或均值。
library(tidyverse)
tbl <- read_table2(
"Subject Value1 Value2 Value3 UniqueNumber
001 1 0 1 3
002 0 1 1 2
003 1 1 1 1"
)
tbl %>%
filter(UniqueNumber > 0) %>%
nest(starts_with("Value"), .key = "Values") %>%
mutate(
sum = map2_dbl(UniqueNumber, Values, ~ sum(.y[1:.x], na.rm = TRUE)),
mean = map2_dbl(UniqueNumber, Values, ~ mean(as.numeric(.y[1:.x], na.rm = TRUE))),
)
#> # A tibble: 3 x 5
#> Subject UniqueNumber Values sum mean
#> <chr> <dbl> <list> <dbl> <dbl>
#> 1 001 3 <tibble [1 × 3]> 2 0.667
#> 2 002 2 <tibble [1 × 3]> 1 0.5
#> 3 003 1 <tibble [1 × 3]> 1 1
由reprex包創建於2019-02-14(v0.2.1)
檢查此解決方案:
df %>%
gather(key, val, Value1:Value3) %>%
group_by(Subject) %>%
mutate(
Sum = sum(val[c(1:(UniqueNumber[1]))]),
Mean = mean(val[c(1:(UniqueNumber[1]))]),
) %>%
spread(key, val)
輸出:
Subject UniqueNumber Sum Mean Value1 Value2 Value3
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001 3 2 0.667 1 0 1
2 002 2 1 0.5 0 1 1
3 003 1 1 1 1 1 1
OP可能只對dplyr
解決方案感興趣,但為了比較目的和未來讀者使用mapply
的基本R選項
cols <- grep("^Value", names(df))
cbind(df, t(mapply(function(x, y) {
if (y > 0) {
vals = as.numeric(df[x, cols[1:y]])
c(Sum = sum(vals, na.rm = TRUE), Mean = mean(vals, na.rm = TRUE))
}
else
c(0, 0)
},1:nrow(df), df$UniqueNumber)))
# Subject Value1 Value2 Value3 UniqueNumber Sum Mean
#1 1 1 0 1 3 2 0.667
#2 2 0 1 1 2 1 0.500
#3 3 1 1 1 1 1 1.000
在這里,我們根據各自的UniqueNumber
對每一行進行子集UniqueNumber
,然后計算它的sum
並mean
UniqueNumber
值是否大於0或者僅返回0。
使用purrr::map_df
(來自與dplyr
相同的作者)的解決方案。
library(dplyr)
library(purrr)
l_dat <- split(dat, dat$Subject) # first we need to split in a list
map_df(l_dat, function(x) {
n_cols <- x$UniqueNumber # finds the number of columns
x <- as.numeric(x[2:(n_cols+1)]) # subsets x and converts to numeric
mean(x, na.rm=T) # mean to be returned
})
# output:
# # A tibble: 1 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
# 1 0.667 0.5 1
另一種選擇(輸出格式更接近dplyr
解決方案):
map_df(l_dat, function(x) {
n_cols <- x$UniqueNumber
id <- x$Subject
x <- as.numeric(x[2:(n_cols+1)])
tibble(id=id, mean_values=mean(x, na.rm=T))
})
# # A tibble: 3 x 2
# id mean_values
# <int> <dbl>
# 1 1 0.667
# 2 2 0.5
# 3 3 1
就像一個例子,我添加了一個sum()
然后除以length(x)-1
:
map_df(l_dat, function(x) {
n_cols <- x$UniqueNumber
id <- x$Subject
x <- as.numeric(x[2:(n_cols+1)])
tibble(id=id,
mean_values=sum(x, na.rm=T)/(length(x)-1)) # change here
})
# # A tibble: 3 x 2
# id mean_values
# <int> <dbl>
# 1 1 1.
# 2 2 1.
# 3 3 Inf #beware of this case where you end up dividing by 0
數據:
tt <- "Subject Value1 Value2 Value3 UniqueNumber
001 1 0 1 3
002 0 1 1 2
003 1 1 1 1"
dat <- read.table(text=tt, header=T)
我認為,最簡單的方法是設置為NA
的零點,確實應該是NA
,然后用rowSums
和rowMeans
在列的適當子集。
Data[2:4][(col(dat[2:4])>dat[[5]])] <- NA
Data
# Subject Value1 Value2 Value3 UniqueNumber
# 1 1 1 0 1 3
# 2 2 0 1 NA 2
# 3 3 1 NA NA 1
library(dplyr)
Data%>%
mutate(sum = rowSums(.[2:4], na.rm = TRUE),
mean = rowMeans(.[2:4], na.rm = TRUE))
# Subject Value1 Value2 Value3 UniqueNumber sum mean
# 1 1 1 0 1 3 2 0.6666667
# 2 2 0 1 NA 2 1 0.5000000
# 3 3 1 NA NA 1 1 1.0000000
或者transform(Data, sum = rowSums(Data[2:4],na.rm = TRUE), mean = rowMeans(Data[2:4],na.rm = TRUE))
留在基地R.
數據
Data <- structure(
list(Subject = 1:3,
Value1 = c(1L, 0L, 1L),
Value2 = c(0L, 1L, NA),
Value3 = c(1L, NA, NA),
UniqueNumber = c(3L, 2L, 1L)),
.Names = c("Subject","Value1", "Value2", "Value3", "UniqueNumber"),
row.names = c(NA, 3L), class = "data.frame")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.