繁体   English   中英

生成数据框列表并应用功能

[英]Generate list of dataframes and apply function

我想生成一个数据帧列表,并将相同的功能应用于每个数据帧。 我不知道如何在没有大量代码行的情况下优雅地执行此操作。

从数据框df中,

id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),
            rep("silver", 4),
            rep("bronze", 4))
df <- data.frame(id, x, y, z, type)

我使用一个基于变量的简单阈值规则创建了一堆其他数据集

df_25 <- df[df$x < 25,]
df_20 <- df[df$x < 20,] 
# and so on

然后,我将函数应用于每个数据集; 我可以分别对每个数据集或数据集列表执行此操作

# individually
df <- df_18 %>%
  dplyr::group_by(id) %>%
  dplyr::mutate(nb1= sum(x),
                nb2 = sum(x != 25))

# to a list 
ls1 <- list(df_25, df_20)

func_1 <- function(x) {
  x <- x %>%
    dplyr::group_by(id) %>%
    dplyr::mutate(nb1= sum(x),
                nb2 = sum(x != 25))
}

ls1 <- lapply(ls1, function(x) {x[c("id","x")] 
  <- lapply(x[c("id","x")], func_1)
  x})


df_25 <- ls1[[1]]

df_20 <- ls1[[2]]

在任何情况下,这都需要花费很多时间和精力,因为我要处理非常大的数据集。 如何通过上面定义的函数来简化和固定具有正确可识别名称的数据集的生成和新变量的创建?

我尚未找到对这个双重问题的正确答案,欢迎您的帮助!

您可以定义threshold向量并lapply聚合。 在基数R中,它可能看起来像这样:

threshold <- c(22, 24, 26)

res <- setNames(lapply(threshold, function(s) {
  sst <- df[df$x < s, ]
  merge(sst, 
        with(sst, aggregate(list(nb1=x, nb2=x != 25), 
                            by=list(id=id), sum), by="id"))
}), threshold)

res
# $`22`
#   id        x        y        z   type      nb1 nb2
# 1  a 20.92786 37.61272 69976.23   gold 20.92786   1
# 2  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 3  c 18.58916 46.08353 69985.98 silver 18.58916   1
# 
# $`24`
#   id        x        y        z   type      nb1 nb2
# 1  a 22.73948 44.29524 70002.81   gold 43.66734   2
# 2  a 20.92786 37.61272 69976.23   gold 43.66734   2
# 3  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 4  c 18.58916 46.08353 69985.98 silver 18.58916   1
# 
# $`26`
#   id        x        y        z   type      nb1 nb2
# 1  a 22.73948 44.29524 70002.81   gold 43.66734   2
# 2  a 20.92786 37.61272 69976.23   gold 43.66734   2
# 3  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 4  c 18.58916 46.08353 69985.98 silver 44.24036   2
# 5  c 25.65120 44.85778 70008.81 bronze 44.24036   2
# 6  d 24.84056 49.22505 69993.87 bronze 24.84056   1

数据

df <- structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 
4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"), 
    x = c(22.7394803492982, 20.927856140076, 30.2395154764033, 
    26.6955462205898, 20.6427460111819, 18.589158456851, 25.6511987559726, 
    24.8405634272769, 28.8534602413068, 26.5376546472448), y = c(44.2952365501829, 
    37.6127198429065, 45.2842176546081, 40.3835729432985, 38.0205610647157, 
    46.083525703352, 44.8577760657779, 49.2250487481642, 40.2699166395278, 
    49.3740993403725), z = c(70002.8091832317, 69976.2314543058, 
    70000.9974233725, 70011.435897774, 69997.249180665, 69985.9786882474, 
    70008.8088326676, 69993.8665395223, 69998.7334115052, 70001.2935411788
    ), type = structure(c(2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 
    1L), .Label = c("bronze", "gold", "silver"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

使用purrr::map遍历阈值向量

library(dplyr)
library(purrr)
map(c(18,20,25) %>%set_names() , ~ df %>% filter(x<.x) %>% 
                          group_by(id) %>%
                          mutate(nb1= sum(x),
                          nb2 = sum(x != 25)))

或者使用map_if将计算应用于nrow()>1 df子集。

map_if(c(18,20,25) %>%set_names(), ~df %>% filter(x<.x) %>% nrow()>1,
                    ~df %>% filter(x<.x) %>% group_by(id) %>%
                            mutate(nb1= sum(x),
                            nb2 = sum(x != 25)), .else = ~NA)

使用tidyverse我们可以将所有这些操作组合在一个链中。

library(tidyverse)

df %>%
  group_split(x > 25, keep = FALSE) %>%
  map(. %>% group_by(id) %>% mutate(nb1= sum(x),nb2 = sum(x != 25)))


#[[1]]
# A tibble: 6 x 7
# Groups:   id [5]
#  id        x     y      z type     nb1   nb2
#  <fct> <dbl> <dbl>  <dbl> <fct>  <dbl> <int>
#1 a      21.4  42.9 70001. gold    21.4     1
#2 b      18.0  45.3 70005. silver  18.0     1
#3 c      23.3  42.7 70006. bronze  23.3     1
#4 d      23.4  40.9 69990. bronze  46.7     2
#5 d      23.3  41.2 70000. bronze  46.7     2
#6 e      22.3  55.9 69991. bronze  22.3     1

#[[2]]
# A tibble: 4 x 7
# Groups:   id [3]
#  id        x     y      z type     nb1   nb2
#  <fct> <dbl> <dbl>  <dbl> <fct>  <dbl> <int>
#1 a      25.8  40.5 69995. gold    25.8     1
#2 b      28.3  41.5 69996. silver  54.5     2
#3 b      26.3  49.3 69993. silver  54.5     2
#4 c      26.5  44.5 69986. silver  26.5     1

在这里,我根据x值将数据分为两组,第一组的值低于25,第二组的值高于25。您可以根据需要更改逻辑。

这为您提供了数据帧列表作为输出,您可以单独访问。

数据

set.seed(1234)
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),rep("silver", 4),rep("bronze", 4))
df <- data.frame(id, x, y, z, type)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM