[英]Count number of columns by row that exceed a value in dataframe
I am working with a big dataframe in R, and I need to compute by each row, the number of columns that exceed a limit saved in another variable in a dataframe. 我正在使用R中的大数据帧,我需要按行计算超出数据帧中另一个变量中保存的限制的列数。 My dataframe Base
looks like this (I add dput()
version in the final side): 我的数据帧Base
看起来像这样(我在最后一面添加了dput()
版本):
ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2
1 001 1 1 1 NA NA NA 2 3
2 002 2 1 5 4 NA NA 2 3
3 003 3 NA 1 NA 1 NA 2 3
4 004 3 NA 3 NA 8 NA 2 3
5 005 4 5 1 NA NA NA 4 5
6 006 9 9 9 NA NA 8 8 9
7 007 1 3 5 9 NA NA 5 4
8 008 NA NA 6 7 9 8 6 5
9 009 1 1 NA NA NA NA 1 2
10 010 3 4 5 5 5 5 2 2
I need to count the columns whose name starts with NT
and that exceeds the column named Limit1
. 我需要计算名称以NT
开头且超过名为Limit1
的列的列。 This value has to be saved in another column. 该值必须保存在另一列中。 The same case is for Limit2
I have to count the columns that start with NT
and exceed the value of Limit2
.Also, the result has to be saved in a new column. 对于Limit2
,同样的情况我必须计算以NT
开头并超过Limit2
的值的Limit2
。此外,结果必须保存在新列中。 I have tried using the next code but it doesn't work: 我尝试使用下一个代码,但它不起作用:
Base$Count1=apply(Base[c(2:7,8)],1,function(x) length(which(x>Base[8] & !is.na(x))))
Moreover, and the important fact, Base
is a sample of a big dataframe with 200000 rows and 60 columns. 而且,重要的事实是, Base
是一个包含200000行和60列的大数据帧的示例。 For this reason my apply
tests don't finish or I got error. 出于这个原因,我的apply
测试没有完成或我得到错误。 I would like to get a result like this: 我想得到这样的结果:
ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2 Count1 Count2
1 001 1 1 1 NA NA NA 2 3 0 0
2 002 2 1 5 4 NA NA 2 3 2 2
3 003 3 NA 1 NA 1 NA 2 3 1 0
4 004 3 NA 3 NA 8 NA 2 3 3 1
5 005 4 5 1 NA NA NA 4 5 1 0
6 006 9 9 9 NA NA 8 8 9 3 0
7 007 1 3 5 9 NA NA 5 4 1 2
8 008 NA NA 6 7 9 8 6 5 3 4
9 009 1 1 NA NA NA NA 1 2 0 0
10 010 3 4 5 5 5 5 2 2 6 6
Where Count1
saves the number of columns that exceeds Limit1
, started with NT
and they aren't NA
. 其中Count1
保存超过Limit1
的列数,从NT
开始,它们不是NA
。 It is the same for Count2
but using Limit2
. 对于Count2
它是相同的但使用Limit2
。 The dput()
version of my datafrmae is the next: 我的datafrmae的dput()
版本是下一个:
Base<-structure(list(ID = c("001", "002", "003", "004", "005", "006",
"007", "008", "009", "010"), NT1 = c(1, 2, 3, 3, 4, 9, 1, NA,
1, 3), NT2 = c(1, 1, NA, NA, 5, 9, 3, NA, 1, 4), NT3 = c(1, 5,
1, 3, 1, 9, 5, 6, NA, 5), NT4 = c(NA, 4, NA, NA, NA, NA, 9, 7,
NA, 5), NT5 = c(NA, NA, 1, 8, NA, NA, NA, 9, NA, 5), NT6 = c(NA,
NA, NA, NA, NA, 8, NA, 8, NA, 5), Limit1 = c(2, 2, 2, 2, 4, 8,
5, 6, 1, 2), Limit2 = c(3, 3, 3, 3, 5, 9, 4, 5, 2, 2)), .Names = c("ID",
"NT1", "NT2", "NT3", "NT4", "NT5", "NT6", "Limit1", "Limit2"), row.names = c(NA,
-10L), class = "data.frame")
Many thanks for your help. 非常感谢您的帮助。
I suggest something like 我建议像
Base$Count1 <- rowSums(Base[,grep("^NT", names(Base))] > Base$Limit1, na.rm=T)
Base$Count2 <- rowSums(Base[,grep("^NT", names(Base))] > Base$Limit2, na.rm=T)
This produces 这产生了
ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2 Count1 Count2
1 001 1 1 1 NA NA NA 2 3 0 0
2 002 2 1 5 4 NA NA 2 3 2 2
3 003 3 NA 1 NA 1 NA 2 3 1 0
4 004 3 NA 3 NA 8 NA 2 3 3 1
5 005 4 5 1 NA NA NA 4 5 1 0
6 006 9 9 9 NA NA 8 8 9 3 0
7 007 1 3 5 9 NA NA 5 4 1 2
8 008 NA NA 6 7 9 8 6 5 3 4
9 009 1 1 NA NA NA NA 1 2 0 0
10 010 3 4 5 5 5 5 2 2 6 6
as desired. 如预期的。
If you have a big data frame, I'd suggest you avoid doing this by row, rather just run this the amount of Limit
columns you have to compare against 如果您有一个大数据框,我建议您不要按行执行此操作,而只是运行此数据,您需要比较的Limit
列数量
sapply(grep("Limit", names(df), value = TRUE),
function(x) rowSums(df[grepl("^NT", names(df))] > df[, x],
na.rm = TRUE))
# Limit1 Limit2
# 1 0 0
# 2 2 2
# 3 1 0
# 4 3 1
# 5 1 0
# 6 3 0
# 7 1 2
# 8 3 4
# 9 0 0
# 10 6 6
If you want to do this using data.table
, you can update your columns by reference, using 如果要使用data.table
执行此操作,可以使用引用按引用更新列
library(data.table)
setDT(df)[, c("Count1", "Count2") :=
lapply(grep("Limit", names(df), value = TRUE),
function(x) rowSums(.SD[,
grepl("^NT", names(df)), with = FALSE] >
.SD[[x]], na.rm = TRUE))
]
The code you are using is a bit off, and this fixes the problem: 您使用的代码有点偏,这解决了问题:
apply(Base[c(2:7, 8)],1,function(x) length(which(x>tail(x, 1) & !is.na(x))))
Since while applying the function, x
is the row you are operating on, compare it with Base[8]
is actually comparing a row with Base[8]
, and that's where the answer is off. 因为在应用函数时, x
是你正在操作的行,将它与Base[8]
进行比较实际上是将一行与Base[8]
进行比较,而这就是答案关闭的地方。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.