[英]How to write a function to count the number of observations based on specific conditions in R?
I have a data frame of 1401 observations of 16 variables. 我有一个16个变量的1401个观测值的数据框。 For each column (except the first one), I have either 1 (if a condition is met) or 0 (if a condition is not met).
对于每一列(第一列除外),我有1(如果满足条件)或0(如果不满足条件)。 Overall, the idea is to count how many observations meet certain conditions successively.
总体而言,该想法是先计算有多少观测连续满足特定条件。 We can think about it as a decision tree: in the first branch you can have either 1 (condition is met) or 0 (condition is not met), in the second branch starting from the 0 of the first branch, you can also have 1 or 0, etc... In my data frame, branches are columns.
我们可以将其视为决策树:在第一个分支中,您可以拥有1(满足条件)或0(不满足条件),在第二个分支中,可以从第一个分支的0开始1或0,等等。在我的数据框中,分支是列。 I want to investigate the impact of looking at the different branches (columns) in various orders.
我想研究以各种顺序查看不同分支(列)的影响。 My idea is to count the number of "1" in column Cn if I know that there was a "0" in column Cn-1.
我的想法是,如果我知道Cn-1列中有一个“ 0”,则在Cn列中计算“ 1”的数目。
dput(droplevels(head(data,20)))
structure(list(Substance = structure(c(13L, 9L, 10L, 12L, 1L,
19L, 16L, 17L, 5L, 2L, 14L, 7L, 4L, 6L, 20L, 18L, 15L, 3L, 11L,
8L), .Label = c("104653-34-1", "107-02-8", "111-30-8", "12057-74-8",
"122454-29-9", "14915-37-8", "20859-73-8", "27083-27-8", "28772-56-7",
"3691-35-8", "55965-84-9", "56073-07-5", "56073-10-0", "5836-29-3",
"71751-41-2", "74-90-8", "81-81-2", "86347-14-0", "90035-08-8",
"91465-08-6"), class = "factor"), colA = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
colB = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L), colC = c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), colD = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L), colE = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L,
1L, 1L), colF = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L), colG = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L), colH = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), colI = c(0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L
), colK = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), colJ = c(0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L,
0L, 0L), colL = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L,
0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L), colM = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_), colN = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L), colO = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("Substance",
"Oral", "Dermal", "Inhalation", "SC", "SED", "RS", "SS", "M",
"C", "R", "STOT.SE", "STOT.RE", "AT", "Eco.Acute", "Eco.Chronic"
), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 12L, 13L,
14L, 17L, 18L, 19L, 20L, 21L, 22L, 28L, 34L), class = "data.frame")
#I define the order in which I look at the columns
orderA <- colnames(data)[2:16]
#no-yes function counts chemicals which meet condition Cn when condition Cn-1 is not met
count_no_yes <- function(data, cols) {
data <- data[, cols]
sum(apply(data, 1, function(x) all(x == 1)))
}
endpoints <- 0:15
#scenario A with order A of the columns
counts <- sapply(1:15, function(i) count_no_yes(data, orderA[1:i]))
counts <- c(nrow(data), counts)
scenarioA <- data.frame(endpoint=endpoints, hits=counts, scenario="scenarioA")
My problem is that I don't know how to include the information from the previous observation in my code. 我的问题是我不知道如何在代码中包含先前观察到的信息。 The current is not working.
电流不起作用。 I get the following error:
Error in apply(data, 1, function(x) all(x == 1)):dim(X) must have a positive length.
我收到以下错误:
Error in apply(data, 1, function(x) all(x == 1)):dim(X) must have a positive length.
The idea is then to plot the number of observations that meet the conditions for each branch of the tree (column). 然后,想法是绘制符合树的每个分支(列)条件的观察次数。
#scenario B with a different order of the columns
orderB <- colnames(data)[c(9, 10, 11, 5, 6, 8, 3, 2, 4, 13, 12, 7, 14, 15, 16)]
counts <- sapply(1:15, function(i) count_yes_yes(data, orderB[1:i]))
counts <- c(nrow(data), counts)
scenarioB <- data.frame(endpoint=endpoints, hits=counts, scenario="scenarioB")
#combine the different scenarios and plot
scenarios <- rbind(scenarioA, scenarioB)
library(ggplot2)
ggplot(scenarios, aes(x=endpoint, y=hits, color=scenario, group=scenario)) +
geom_point() +
geom_line()
Could it be this? 可能是这个吗?
we tidy the data with tidy::gather
then dplyr::group_by(par)
and count the number of times a 0 is followed by a 1. 我们先用
tidy::gather
整理数据,然后用dplyr::group_by(par)
整理数据,然后计算0后面跟1的次数。
my.fun <- function(x) {
#Values
v <-rle(x)[[2]]
#Consecutive lenght
l <- rle(x)[[1]]
tmp <- data.frame(v = v, l=l)
tmp <-
tmp %>%
# for each column find a substance with
# 1 which came after a substance with value 0
# and check that 1 is followed by a zero
mutate(flag = ifelse(v==1 & lag(v)==0 & lead(v) == 0, 1, 0))
#return the sum of the `flag`value
sum(tmp$flag, na.rm = TRUE)
}
df %>%
tidyr::gather("par", "value", everything(), -Substance) %>%
group_by(par) %>%
summarise(c = my.fun(value))
# A tibble: 15 x 2
par c
<chr> <dbl>
1 AT 0
2 C 0
3 Dermal 0
4 Eco.Acute 1
5 Eco.Chronic 0
6 Inhalation 0
7 M 0
8 Oral 0
9 R 4
10 RS 1
11 SC 2
12 SED 1
13 SS 0
14 STOT.RE 4
15 STOT.SE 3
the rle
function is a real gem for analyzing consecutiveness in a vector. rle
函数是用于分析向量连续性的真正工具。 The my.fun
can probably be adjusted to your exact needs. my.fun
可以根据您的实际需求进行调整。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.