簡體   English   中英

生成數據框列表並應用功能

[英]Generate list of dataframes and apply function

我想生成一個數據幀列表,並將相同的功能應用於每個數據幀。 我不知道如何在沒有大量代碼行的情況下優雅地執行此操作。

從數據框df中,

id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),
            rep("silver", 4),
            rep("bronze", 4))
df <- data.frame(id, x, y, z, type)

我使用一個基於變量的簡單閾值規則創建了一堆其他數據集

df_25 <- df[df$x < 25,]
df_20 <- df[df$x < 20,] 
# and so on

然后,我將函數應用於每個數據集; 我可以分別對每個數據集或數據集列表執行此操作

# individually
df <- df_18 %>%
  dplyr::group_by(id) %>%
  dplyr::mutate(nb1= sum(x),
                nb2 = sum(x != 25))

# to a list 
ls1 <- list(df_25, df_20)

func_1 <- function(x) {
  x <- x %>%
    dplyr::group_by(id) %>%
    dplyr::mutate(nb1= sum(x),
                nb2 = sum(x != 25))
}

ls1 <- lapply(ls1, function(x) {x[c("id","x")] 
  <- lapply(x[c("id","x")], func_1)
  x})


df_25 <- ls1[[1]]

df_20 <- ls1[[2]]

在任何情況下,這都需要花費很多時間和精力,因為我要處理非常大的數據集。 如何通過上面定義的函數來簡化和固定具有正確可識別名稱的數據集的生成和新變量的創建?

我尚未找到對這個雙重問題的正確答案,歡迎您的幫助!

您可以定義threshold向量並lapply聚合。 在基數R中,它可能看起來像這樣:

threshold <- c(22, 24, 26)

res <- setNames(lapply(threshold, function(s) {
  sst <- df[df$x < s, ]
  merge(sst, 
        with(sst, aggregate(list(nb1=x, nb2=x != 25), 
                            by=list(id=id), sum), by="id"))
}), threshold)

res
# $`22`
#   id        x        y        z   type      nb1 nb2
# 1  a 20.92786 37.61272 69976.23   gold 20.92786   1
# 2  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 3  c 18.58916 46.08353 69985.98 silver 18.58916   1
# 
# $`24`
#   id        x        y        z   type      nb1 nb2
# 1  a 22.73948 44.29524 70002.81   gold 43.66734   2
# 2  a 20.92786 37.61272 69976.23   gold 43.66734   2
# 3  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 4  c 18.58916 46.08353 69985.98 silver 18.58916   1
# 
# $`26`
#   id        x        y        z   type      nb1 nb2
# 1  a 22.73948 44.29524 70002.81   gold 43.66734   2
# 2  a 20.92786 37.61272 69976.23   gold 43.66734   2
# 3  b 20.64275 38.02056 69997.25 silver 20.64275   1
# 4  c 18.58916 46.08353 69985.98 silver 44.24036   2
# 5  c 25.65120 44.85778 70008.81 bronze 44.24036   2
# 6  d 24.84056 49.22505 69993.87 bronze 24.84056   1

數據

df <- structure(list(id = structure(c(1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 
4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"), 
    x = c(22.7394803492982, 20.927856140076, 30.2395154764033, 
    26.6955462205898, 20.6427460111819, 18.589158456851, 25.6511987559726, 
    24.8405634272769, 28.8534602413068, 26.5376546472448), y = c(44.2952365501829, 
    37.6127198429065, 45.2842176546081, 40.3835729432985, 38.0205610647157, 
    46.083525703352, 44.8577760657779, 49.2250487481642, 40.2699166395278, 
    49.3740993403725), z = c(70002.8091832317, 69976.2314543058, 
    70000.9974233725, 70011.435897774, 69997.249180665, 69985.9786882474, 
    70008.8088326676, 69993.8665395223, 69998.7334115052, 70001.2935411788
    ), type = structure(c(2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 
    1L), .Label = c("bronze", "gold", "silver"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

使用purrr::map遍歷閾值向量

library(dplyr)
library(purrr)
map(c(18,20,25) %>%set_names() , ~ df %>% filter(x<.x) %>% 
                          group_by(id) %>%
                          mutate(nb1= sum(x),
                          nb2 = sum(x != 25)))

或者使用map_if將計算應用於nrow()>1 df子集。

map_if(c(18,20,25) %>%set_names(), ~df %>% filter(x<.x) %>% nrow()>1,
                    ~df %>% filter(x<.x) %>% group_by(id) %>%
                            mutate(nb1= sum(x),
                            nb2 = sum(x != 25)), .else = ~NA)

使用tidyverse我們可以將所有這些操作組合在一個鏈中。

library(tidyverse)

df %>%
  group_split(x > 25, keep = FALSE) %>%
  map(. %>% group_by(id) %>% mutate(nb1= sum(x),nb2 = sum(x != 25)))


#[[1]]
# A tibble: 6 x 7
# Groups:   id [5]
#  id        x     y      z type     nb1   nb2
#  <fct> <dbl> <dbl>  <dbl> <fct>  <dbl> <int>
#1 a      21.4  42.9 70001. gold    21.4     1
#2 b      18.0  45.3 70005. silver  18.0     1
#3 c      23.3  42.7 70006. bronze  23.3     1
#4 d      23.4  40.9 69990. bronze  46.7     2
#5 d      23.3  41.2 70000. bronze  46.7     2
#6 e      22.3  55.9 69991. bronze  22.3     1

#[[2]]
# A tibble: 4 x 7
# Groups:   id [3]
#  id        x     y      z type     nb1   nb2
#  <fct> <dbl> <dbl>  <dbl> <fct>  <dbl> <int>
#1 a      25.8  40.5 69995. gold    25.8     1
#2 b      28.3  41.5 69996. silver  54.5     2
#3 b      26.3  49.3 69993. silver  54.5     2
#4 c      26.5  44.5 69986. silver  26.5     1

在這里,我根據x值將數據分為兩組,第一組的值低於25,第二組的值高於25。您可以根據需要更改邏輯。

這為您提供了數據幀列表作為輸出,您可以單獨訪問。

數據

set.seed(1234)
id <- c('a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd', 'e')
x <- rnorm(n = 10, mean = 25, sd = 3)
y <- rnorm(n = 10, mean = 45, sd = 4.5)
z <- rnorm(n = 10, mean = 70000, sd = 10)
type <- c(rep("gold", 2),rep("silver", 4),rep("bronze", 4))
df <- data.frame(id, x, y, z, type)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM