R: find number of columns > 0 per row for a group of column names with a partial string match

Question

I have a dataframe that resembles the following:

ID	X	Y	A_1_l	A_2_m	B_1_n	B_2_l	C_1_m	C_2_n	C_3_l
w	X	Y	0	0	0	0	0	0	0
x	X	Y	0	0	3	0	0	0	0
y	X	Y	0	1	0	4	0	1	0
z	X	Y	3	4	5	6	2	1	5

The first letter denotes a sample, the number a repetition and the second letter a batch. I am trying to find a count of the number of samples with at least one value > 0 for each ID and store these numbers in a list.

This is the desired result as a list that I can append to a an existing dataframe:

0,1,3,3

For a previous analysis I used strsplit to count the total number of samples per batch.

colsList <- colnames(df)
cols <- grep("_", colsList, value=TRUE)
splitList <- strsplit(cols, "_\\d_")
stats <-data.frame(t(as.data.frame.list(splitList)))
rownames(stats)<-NULL
names(stats)<-c("Sample", "Batch")
perSample <- aggregate(Sample ~ Batch, stats, 
                      function(x) length(unique(x))) # number of strains

And I was able to find the total number of columns with a value > 0 using rowSums(df[sapply(df, is.numeric)] > 0) but I cant seem to figure out how to combine the two to find the total number of samples > 0

Answer 1

First filter the data to keep only the numeric columns.

Use split.default to divide the data into groups so that you have all the 'A' columns in one group, 'B' in another and so on. Within each group return TRUE if a row has a single value which is greater than 0, sum all the values together from all the groups to get final count.

tmp <- Filter(is.numeric, df)

rowSums(sapply(split.default(tmp, sub('_.*', '', names(tmp))), 
        function(x) rowSums(x) > 0))

#[1] 0 1 3 3

Answer 2

We can do this in tidyverse

library(dplyr)
library(stringr)
library(tidyr)
df1 %>%  
    select(ID, where(is.numeric)) %>%
    pivot_longer(cols = -ID) %>%
    mutate(name = str_remove(name, "_.*")) %>% 
    group_by(ID, name) %>% 
    summarise(value = sum(value > 0), .groups = 'drop_last') %>% 
    summarise(value = sum(value > 0))
# A tibble: 4 x 2
  ID    value
  <chr> <int>
1 w         0
2 x         1
3 y         3
4 z         3

data

df1 <- structure(list(ID = c("w", "x", "y", "z"), X = c("X", "X", "X", 
"X"), Y = c("Y", "Y", "Y", "Y"), A_1_l = c(0L, 0L, 0L, 3L), A_2_m = c(0L, 
0L, 1L, 4L), B_1_n = c(0L, 3L, 0L, 5L), B_2_l = c(0L, 0L, 4L, 
6L), C_1_m = c(0L, 0L, 0L, 2L), C_2_n = c(0L, 0L, 1L, 1L), C_3_l = c(0L, 
0L, 0L, 5L)), class = "data.frame", row.names = c(NA, -4L))

R: find number of columns > 0 per row for a group of column names with a partial string match

Question

2 answers

solution1
0 ACCPTED 2021-07-10 07:08:56

solution2
0 2021-07-10 19:32:05

data

R: find number of columns > 0 per row for a group of column names with a partial string match

Question

2 answers

solution1 0 ACCPTED 2021-07-10 07:08:56

solution2 0 2021-07-10 19:32:05

data

solution1
0 ACCPTED 2021-07-10 07:08:56

solution2
0 2021-07-10 19:32:05