[英]Generate list of dataframes and apply function
我想生成一個數據幀列表,並將相同的功能應用於每個數據幀。 我不知道如何在沒有大量代碼行的情況下優雅地執行此操作。
從數據框df中,
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),
rep("silver", 4),
rep("bronze", 4))
df <- data.frame(id, x, y, z, type)
我使用一個基於變量的簡單閾值規則創建了一堆其他數據集
df_25 <- df[df$x < 25,]
df_20 <- df[df$x < 20,]
# and so on
然后,我將函數應用於每個數據集; 我可以分別對每個數據集或數據集列表執行此操作
# individually
df <- df_18 %>%
dplyr::group_by(id) %>%
dplyr::mutate(nb1= sum(x),
nb2 = sum(x != 25))
# to a list
ls1 <- list(df_25, df_20)
func_1 <- function(x) {
x <- x %>%
dplyr::group_by(id) %>%
dplyr::mutate(nb1= sum(x),
nb2 = sum(x != 25))
}
ls1 <- lapply(ls1, function(x) {x[c("id","x")]
<- lapply(x[c("id","x")], func_1)
x})
df_25 <- ls1[[1]]
df_20 <- ls1[[2]]
在任何情況下,這都需要花費很多時間和精力,因為我要處理非常大的數據集。 如何通過上面定義的函數來簡化和固定具有正確可識別名稱的數據集的生成和新變量的創建?
我尚未找到對這個雙重問題的正確答案,歡迎您的幫助!
您可以定義threshold
向量並lapply
聚合。 在基數R中,它可能看起來像這樣:
threshold <- c(22, 24, 26)
res <- setNames(lapply(threshold, function(s) {
sst <- df[df$x < s, ]
merge(sst,
with(sst, aggregate(list(nb1=x, nb2=x != 25),
by=list(id=id), sum), by="id"))
}), threshold)
res
# $`22`
# id x y z type nb1 nb2
# 1 a 20.92786 37.61272 69976.23 gold 20.92786 1
# 2 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 3 c 18.58916 46.08353 69985.98 silver 18.58916 1
#
# $`24`
# id x y z type nb1 nb2
# 1 a 22.73948 44.29524 70002.81 gold 43.66734 2
# 2 a 20.92786 37.61272 69976.23 gold 43.66734 2
# 3 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 4 c 18.58916 46.08353 69985.98 silver 18.58916 1
#
# $`26`
# id x y z type nb1 nb2
# 1 a 22.73948 44.29524 70002.81 gold 43.66734 2
# 2 a 20.92786 37.61272 69976.23 gold 43.66734 2
# 3 b 20.64275 38.02056 69997.25 silver 20.64275 1
# 4 c 18.58916 46.08353 69985.98 silver 44.24036 2
# 5 c 25.65120 44.85778 70008.81 bronze 44.24036 2
# 6 d 24.84056 49.22505 69993.87 bronze 24.84056 1
數據
df <- structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L,
4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
x = c(22.7394803492982, 20.927856140076, 30.2395154764033,
26.6955462205898, 20.6427460111819, 18.589158456851, 25.6511987559726,
24.8405634272769, 28.8534602413068, 26.5376546472448), y = c(44.2952365501829,
37.6127198429065, 45.2842176546081, 40.3835729432985, 38.0205610647157,
46.083525703352, 44.8577760657779, 49.2250487481642, 40.2699166395278,
49.3740993403725), z = c(70002.8091832317, 69976.2314543058,
70000.9974233725, 70011.435897774, 69997.249180665, 69985.9786882474,
70008.8088326676, 69993.8665395223, 69998.7334115052, 70001.2935411788
), type = structure(c(2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L), .Label = c("bronze", "gold", "silver"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
使用purrr::map
遍歷閾值向量
library(dplyr)
library(purrr)
map(c(18,20,25) %>%set_names() , ~ df %>% filter(x<.x) %>%
group_by(id) %>%
mutate(nb1= sum(x),
nb2 = sum(x != 25)))
或者使用map_if
將計算應用於nrow()>1
df子集。
map_if(c(18,20,25) %>%set_names(), ~df %>% filter(x<.x) %>% nrow()>1,
~df %>% filter(x<.x) %>% group_by(id) %>%
mutate(nb1= sum(x),
nb2 = sum(x != 25)), .else = ~NA)
使用tidyverse
我們可以將所有這些操作組合在一個鏈中。
library(tidyverse)
df %>%
group_split(x > 25, keep = FALSE) %>%
map(. %>% group_by(id) %>% mutate(nb1= sum(x),nb2 = sum(x != 25)))
#[[1]]
# A tibble: 6 x 7
# Groups: id [5]
# id x y z type nb1 nb2
# <fct> <dbl> <dbl> <dbl> <fct> <dbl> <int>
#1 a 21.4 42.9 70001. gold 21.4 1
#2 b 18.0 45.3 70005. silver 18.0 1
#3 c 23.3 42.7 70006. bronze 23.3 1
#4 d 23.4 40.9 69990. bronze 46.7 2
#5 d 23.3 41.2 70000. bronze 46.7 2
#6 e 22.3 55.9 69991. bronze 22.3 1
#[[2]]
# A tibble: 4 x 7
# Groups: id [3]
# id x y z type nb1 nb2
# <fct> <dbl> <dbl> <dbl> <fct> <dbl> <int>
#1 a 25.8 40.5 69995. gold 25.8 1
#2 b 28.3 41.5 69996. silver 54.5 2
#3 b 26.3 49.3 69993. silver 54.5 2
#4 c 26.5 44.5 69986. silver 26.5 1
在這里,我根據x
值將數據分為兩組,第一組的值低於25,第二組的值高於25。您可以根據需要更改邏輯。
這為您提供了數據幀列表作為輸出,您可以單獨訪問。
數據
set.seed(1234)
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),rep("silver", 4),rep("bronze", 4))
df <- data.frame(id, x, y, z, type)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.