在 lapply/apply function 中使用 ifelse function

Question

Trying to apply a function to a large dataset.尝试将 function 应用于大型数据集。 Specifically, trying to apply the mean of the lowest 1000 times (df$time) set before the date (df$date) found in that row.具体来说，尝试应用在该行中找到的日期 (df$date) 之前设置的最低 1000 次 (df$time) 的平均值。 Applying this function on a small portion worked将此 function 应用于一小部分工作

However, because the dataset is so large, I want to restrict the apply to just the 1% of rows where df$wr is true.但是，由于数据集太大，我想将应用限制在 df$wr 为 true 的 1% 的行中。

This is the code I wrote so far with mean1000 as the intended name of the new variable and the data set split based on name (25 categories):这是我到目前为止编写的代码，将 mean1000 作为新变量的预期名称，并根据名称拆分数据集（25 个类别）：

df1 <- data.frame(
 mean1000 = lapply(
    split(df, df$name), function(y) 
      df$y$mean1000 = apply(y, 1, function(x) {ifelse(x["wr" == TRUE], 
        mean(sort(df$time[df$date < x["date"]])[2:1000]), NA)})) %>% 
  unlist()
)

Result:结果：

df1 is created, but it's just a table with 0 observations of 1 variable (mean1000) df1 已创建，但它只是一个表，其中包含 1 个变量 (mean1000) 的 0 个观察值

The error message is 25 times the following:错误消息是以下的 25 次：

1. Unknown or uninitialised column `y`.

I mostly followed the guidelines as outlined here , but those solutions are less complex/layered than what I'm trying to do.我主要遵循此处概述的指南，但这些解决方案没有我想要做的复杂/分层。 How can I adjust the code?如何调整代码？

Data:数据：

| # | time | date      | id1 | id2 | rank | name  | wr   |
|---|------|-----------|-----|-----|------|-------|------|
| 1 | 2408 | 2022-06-04| a8m2| pr9w| 24   | City01| TRUE |
| 2 | 2503 | 2022-06-25| b6p5| ur1r| 226  | City01| FALSE|
| 3 | 2672 | 2022-05-07| c8k1| py5l| 371  | City01| FALSE|

The desired result is to have an extra column added in which the mean calculated ( mean(sort(df$time[df$date < x["date"]])[2:1000]) ) is added when the wr value is TRUE.期望的结果是添加一个额外的列，当wr值为真的。

Answer 1

Consider by (object-oriented wrapper to tapply ) which is very similar to split + lapply but more streamlined.考虑by (object-oriented wrapper to tapply )，它与split + lapply非常相似，但更精简。 Then run an embedded sapply for rowwise mean conditional calculations.然后运行嵌入式sapply进行逐行平均条件计算。

# SORT DATA BY NAME AND DATE
df1 <- with(df1, df1[order(name, date),]) |> `row.names<-`(NULL)

# CONDITIONALLY CALCULATE MEAN BY GROUP
df1$mean100 <- by(df1, df1$name, function(sub), {
    # ITERATE THROUGH EVERY DATE ROW
    mean1000 <- sapply(
         sub$date,
         # SUBSET AND CALCULATE MEAN
         FUN=\(dt) mean(sub$time[sub$date < dt][2:1000], na.rm=TRUE)
    )
    # CONDITIONALLY ADJUST BY wr FLAG
    mean1000 <- ifelse(sub$wr == TRUE, mean1000, NA_real_)
})

在 lapply/apply function 中使用 ifelse function

问题描述

1 个解决方案

解决方案1
0 2022-07-24 16:55:30

在 lapply/apply function 中使用 ifelse function

问题描述

1 个解决方案

解决方案1 0 2022-07-24 16:55:30

解决方案1
0 2022-07-24 16:55:30