簡體   English   中英

如何編寫函數以基於R中的特定條件對觀察次數進行計數?

[英]How to write a function to count the number of observations based on specific conditions in R?

我有一個16個變量的1401個觀測值的數據框。 對於每一列(第一列除外),我有1(如果滿足條件)或0(如果不滿足條件)。 總體而言,該想法是先計算有多少觀測連續滿足特定條件。 我們可以將其視為決策樹:在第一個分支中,您可以擁有1(滿足條件)或0(不滿足條件),在第二個分支中,可以從第一個分支的0開始1或0,等等。在我的數據框中,分支是列。 我想研究以各種順序查看不同分支(列)的影響。 我的想法是,如果我知道Cn-1列中有一個“ 0”,則在Cn列中計算“ 1”的數目。

dput(droplevels(head(data,20)))
structure(list(Substance = structure(c(13L, 9L, 10L, 12L, 1L, 
19L, 16L, 17L, 5L, 2L, 14L, 7L, 4L, 6L, 20L, 18L, 15L, 3L, 11L, 
8L), .Label = c("104653-34-1", "107-02-8", "111-30-8", "12057-74-8", 
"122454-29-9", "14915-37-8", "20859-73-8", "27083-27-8", "28772-56-7", 
"3691-35-8", "55965-84-9", "56073-07-5", "56073-10-0", "5836-29-3", 
"71751-41-2", "74-90-8", "81-81-2", "86347-14-0", "90035-08-8", 
"91465-08-6"), class = "factor"), colA = c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
    colB = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), colC = c(1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L), colD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L), colE = c(0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
    1L, 1L), colF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L), colG = c(0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 
    1L), colH = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), colI = c(0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L
    ), colK = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 
    0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), colJ = c(0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 
    0L, 0L), colL = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 
    0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), colM = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_), colN = c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), colO = c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("Substance", 
"Oral", "Dermal", "Inhalation", "SC", "SED", "RS", "SS", "M", 
"C", "R", "STOT.SE", "STOT.RE", "AT", "Eco.Acute", "Eco.Chronic"
), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 12L, 13L, 
14L, 17L, 18L, 19L, 20L, 21L, 22L, 28L, 34L), class = "data.frame")
#I define the order in which I look at the columns
orderA <- colnames(data)[2:16]
#no-yes function counts chemicals which meet condition Cn when condition Cn-1 is not met
count_no_yes <- function(data, cols) {
    data <- data[, cols]
    sum(apply(data, 1, function(x) all(x == 1)))
}
endpoints <- 0:15
#scenario A with order A of the columns
counts <- sapply(1:15, function(i) count_no_yes(data, orderA[1:i]))
counts <- c(nrow(data), counts)
scenarioA <- data.frame(endpoint=endpoints, hits=counts, scenario="scenarioA")

我的問題是我不知道如何在代碼中包含先前觀察到的信息。 電流不起作用。 我收到以下錯誤: Error in apply(data, 1, function(x) all(x == 1)):dim(X) must have a positive length.

然后,想法是繪制符合樹的每個分支(列)條件的觀察次數。

#scenario B with a different order of the columns
orderB <- colnames(data)[c(9, 10, 11, 5, 6, 8, 3, 2, 4, 13, 12, 7, 14, 15, 16)]
counts <- sapply(1:15, function(i) count_yes_yes(data, orderB[1:i]))
counts <- c(nrow(data), counts)
scenarioB <- data.frame(endpoint=endpoints, hits=counts, scenario="scenarioB")
#combine the different scenarios and plot
scenarios <- rbind(scenarioA, scenarioB)
library(ggplot2)
ggplot(scenarios, aes(x=endpoint, y=hits, color=scenario, group=scenario)) + 
  geom_point() +
  geom_line()

可能是這個嗎?

我們先用tidy::gather整理數據,然后用dplyr::group_by(par)整理數據,然后計算0后面跟1的次數。

my.fun <- function(x) {
  #Values
  v <-rle(x)[[2]]
  #Consecutive lenght
  l <- rle(x)[[1]]

  tmp <- data.frame(v = v, l=l)
  tmp <-
    tmp %>% 
    # for each column find a substance with 
    # 1 which came after a substance with value 0
    # and check that 1 is followed by a zero
    mutate(flag = ifelse(v==1 & lag(v)==0 & lead(v) == 0, 1, 0))

  #return the sum of the `flag`value
  sum(tmp$flag, na.rm = TRUE)
}

df %>% 
  tidyr::gather("par", "value", everything(), -Substance) %>% 
  group_by(par) %>% 
  summarise(c = my.fun(value))


    # A tibble: 15 x 2
   par             c
   <chr>       <dbl>
 1 AT              0
 2 C               0
 3 Dermal          0
 4 Eco.Acute       1
 5 Eco.Chronic     0
 6 Inhalation      0
 7 M               0
 8 Oral            0
 9 R               4
10 RS              1
11 SC              2
12 SED             1
13 SS              0
14 STOT.RE         4
15 STOT.SE         3

rle函數是用於分析向量連續性的真正工具。 my.fun可以根據您的實際需求進行調整。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM