简体   繁体   English

计算按行划分的列数,超过dataframe中的值

[英]Count number of columns by row that exceed a value in dataframe

I am working with a big dataframe in R, and I need to compute by each row, the number of columns that exceed a limit saved in another variable in a dataframe. 我正在使用R中的大数据帧,我需要按行计算超出数据帧中另一个变量中保存的限制的列数。 My dataframe Base looks like this (I add dput() version in the final side): 我的数据帧Base看起来像这样(我在最后一面添加了dput()版本):

    ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2
1  001   1   1   1  NA  NA  NA      2      3
2  002   2   1   5   4  NA  NA      2      3
3  003   3  NA   1  NA   1  NA      2      3
4  004   3  NA   3  NA   8  NA      2      3
5  005   4   5   1  NA  NA  NA      4      5
6  006   9   9   9  NA  NA   8      8      9
7  007   1   3   5   9  NA  NA      5      4
8  008  NA  NA   6   7   9   8      6      5
9  009   1   1  NA  NA  NA  NA      1      2
10 010   3   4   5   5   5   5      2      2

I need to count the columns whose name starts with NT and that exceeds the column named Limit1 . 我需要计算名称以NT开头且超过名为Limit1的列的列。 This value has to be saved in another column. 该值必须保存在另一列中。 The same case is for Limit2 I have to count the columns that start with NT and exceed the value of Limit2 .Also, the result has to be saved in a new column. 对于Limit2 ,同样的情况我必须计算以NT开头并超过Limit2的值的Limit2 。此外,结果必须保存在新列中。 I have tried using the next code but it doesn't work: 我尝试使用下一个代码,但它不起作用:

Base$Count1=apply(Base[c(2:7,8)],1,function(x) length(which(x>Base[8] & !is.na(x))))

Moreover, and the important fact, Base is a sample of a big dataframe with 200000 rows and 60 columns. 而且,重要的事实是, Base是一个包含200000行和60列的大数据帧的示例。 For this reason my apply tests don't finish or I got error. 出于这个原因,我的apply测试没有完成或我得到错误。 I would like to get a result like this: 我想得到这样的结果:

    ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2 Count1 Count2
1  001   1   1   1  NA  NA  NA      2      3      0      0
2  002   2   1   5   4  NA  NA      2      3      2      2
3  003   3  NA   1  NA   1  NA      2      3      1      0
4  004   3  NA   3  NA   8  NA      2      3      3      1
5  005   4   5   1  NA  NA  NA      4      5      1      0
6  006   9   9   9  NA  NA   8      8      9      3      0
7  007   1   3   5   9  NA  NA      5      4      1      2
8  008  NA  NA   6   7   9   8      6      5      3      4
9  009   1   1  NA  NA  NA  NA      1      2      0      0
10 010   3   4   5   5   5   5      2      2      6      6

Where Count1 saves the number of columns that exceeds Limit1 , started with NT and they aren't NA . 其中Count1保存超过Limit1的列数,从NT开始,它们不是NA It is the same for Count2 but using Limit2 . 对于Count2它是相同的但使用Limit2 The dput() version of my datafrmae is the next: 我的datafrmae的dput()版本是下一个:

Base<-structure(list(ID = c("001", "002", "003", "004", "005", "006", 
"007", "008", "009", "010"), NT1 = c(1, 2, 3, 3, 4, 9, 1, NA, 
1, 3), NT2 = c(1, 1, NA, NA, 5, 9, 3, NA, 1, 4), NT3 = c(1, 5, 
1, 3, 1, 9, 5, 6, NA, 5), NT4 = c(NA, 4, NA, NA, NA, NA, 9, 7, 
NA, 5), NT5 = c(NA, NA, 1, 8, NA, NA, NA, 9, NA, 5), NT6 = c(NA, 
NA, NA, NA, NA, 8, NA, 8, NA, 5), Limit1 = c(2, 2, 2, 2, 4, 8, 
5, 6, 1, 2), Limit2 = c(3, 3, 3, 3, 5, 9, 4, 5, 2, 2)), .Names = c("ID", 
"NT1", "NT2", "NT3", "NT4", "NT5", "NT6", "Limit1", "Limit2"), row.names = c(NA, 
-10L), class = "data.frame")

Many thanks for your help. 非常感谢您的帮助。

I suggest something like 我建议像

Base$Count1 <- rowSums(Base[,grep("^NT", names(Base))] > Base$Limit1, na.rm=T)
Base$Count2 <- rowSums(Base[,grep("^NT", names(Base))] > Base$Limit2, na.rm=T)

This produces 这产生了

    ID NT1 NT2 NT3 NT4 NT5 NT6 Limit1 Limit2 Count1 Count2
1  001   1   1   1  NA  NA  NA      2      3      0      0
2  002   2   1   5   4  NA  NA      2      3      2      2
3  003   3  NA   1  NA   1  NA      2      3      1      0
4  004   3  NA   3  NA   8  NA      2      3      3      1
5  005   4   5   1  NA  NA  NA      4      5      1      0
6  006   9   9   9  NA  NA   8      8      9      3      0
7  007   1   3   5   9  NA  NA      5      4      1      2
8  008  NA  NA   6   7   9   8      6      5      3      4
9  009   1   1  NA  NA  NA  NA      1      2      0      0
10 010   3   4   5   5   5   5      2      2      6      6

as desired. 如预期的。

If you have a big data frame, I'd suggest you avoid doing this by row, rather just run this the amount of Limit columns you have to compare against 如果您有一个大数据框,我建议您不要按行执行此操作,而只是运行此数据,您需要比较的Limit列数量

sapply(grep("Limit", names(df), value = TRUE), 
        function(x) rowSums(df[grepl("^NT", names(df))] > df[, x], 
        na.rm = TRUE))

#    Limit1 Limit2
# 1       0      0
# 2       2      2
# 3       1      0
# 4       3      1
# 5       1      0
# 6       3      0
# 7       1      2
# 8       3      4
# 9       0      0
# 10      6      6

If you want to do this using data.table , you can update your columns by reference, using 如果要使用data.table执行此操作,可以使用引用按引用更新列

library(data.table)
setDT(df)[, c("Count1", "Count2") := 
            lapply(grep("Limit", names(df), value = TRUE),
                   function(x) rowSums(.SD[, 
                     grepl("^NT", names(df)), with = FALSE] > 
                     .SD[[x]], na.rm = TRUE))
          ]

The code you are using is a bit off, and this fixes the problem: 您使用的代码有点偏,这解决了问题:

apply(Base[c(2:7, 8)],1,function(x) length(which(x>tail(x, 1) & !is.na(x))))

Since while applying the function, x is the row you are operating on, compare it with Base[8] is actually comparing a row with Base[8] , and that's where the answer is off. 因为在应用函数时, x是你正在操作的行,将它与Base[8]进行比较实际上是将一行与Base[8]进行比较,而这就是答案关闭的地方。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如果我有一个包含多个列的 DataFrame,我如何计算超过阈值的变量? - If I have a DataFrame with a number of columns, how do I count variables that exceed a threshold? 在R中,我如何优雅地计算多列的中位数,然后计算每行中超过中位数的单元格数? - In R, how can I elegantly compute the medians for multiple columns, and then count the number of cells in each row that exceed the median? 计算具有特定值的行中的列数 - Count the number of columns in a row with a specific value 计算每个值在行数据框中出现的次数r - Count the number of times each value appears in a row dataframe r 根据R中行的值向数据框中添加不同数量的列 - Add a different number of columns to dataframe depending on value from row in R 计算 dataframe 的每一行之间的公共列数以创建一个全对全矩阵 - Count number of common columns between every row of a dataframe to create an all-vs-all matrix 如何在 dataframe 的多列中“计算”单行中非空值的数量 - How to 'count' number of non-empty values in a single row across multiple columns in a dataframe 如何计算一行在未知数量的列中具有值的次数? - How can I count the number of times a row has a value in an unknown number of columns? 按条件 (&gt;) 计算每行的列数 - Count number of columns by a condition (>) for each row 计算每行的 INPUT 列数 - Count number of columns with INPUT for each row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM