简体   繁体   中英

R: find number of columns > 0 per row for a group of column names with a partial string match

I have a dataframe that resembles the following:

ID X Y A_1_l A_2_m B_1_n B_2_l C_1_m C_2_n C_3_l
w X Y 0 0 0 0 0 0 0
x X Y 0 0 3 0 0 0 0
y X Y 0 1 0 4 0 1 0
z X Y 3 4 5 6 2 1 5

The first letter denotes a sample, the number a repetition and the second letter a batch. I am trying to find a count of the number of samples with at least one value > 0 for each ID and store these numbers in a list.

This is the desired result as a list that I can append to a an existing dataframe:

0,1,3,3

For a previous analysis I used strsplit to count the total number of samples per batch.

colsList <- colnames(df)
cols <- grep("_", colsList, value=TRUE)
splitList <- strsplit(cols, "_\\d_")
stats <-data.frame(t(as.data.frame.list(splitList)))
rownames(stats)<-NULL
names(stats)<-c("Sample", "Batch")
perSample <- aggregate(Sample ~ Batch, stats, 
                      function(x) length(unique(x))) # number of strains

And I was able to find the total number of columns with a value > 0 using rowSums(df[sapply(df, is.numeric)] > 0) but I cant seem to figure out how to combine the two to find the total number of samples > 0

First filter the data to keep only the numeric columns.

Use split.default to divide the data into groups so that you have all the 'A' columns in one group, 'B' in another and so on. Within each group return TRUE if a row has a single value which is greater than 0, sum all the values together from all the groups to get final count.

tmp <- Filter(is.numeric, df)

rowSums(sapply(split.default(tmp, sub('_.*', '', names(tmp))), 
        function(x) rowSums(x) > 0))

#[1] 0 1 3 3

We can do this in tidyverse

library(dplyr)
library(stringr)
library(tidyr)
df1 %>%  
    select(ID, where(is.numeric)) %>%
    pivot_longer(cols = -ID) %>%
    mutate(name = str_remove(name, "_.*")) %>% 
    group_by(ID, name) %>% 
    summarise(value = sum(value > 0), .groups = 'drop_last') %>% 
    summarise(value = sum(value > 0))
# A tibble: 4 x 2
  ID    value
  <chr> <int>
1 w         0
2 x         1
3 y         3
4 z         3

data

df1 <- structure(list(ID = c("w", "x", "y", "z"), X = c("X", "X", "X", 
"X"), Y = c("Y", "Y", "Y", "Y"), A_1_l = c(0L, 0L, 0L, 3L), A_2_m = c(0L, 
0L, 1L, 4L), B_1_n = c(0L, 3L, 0L, 5L), B_2_l = c(0L, 0L, 4L, 
6L), C_1_m = c(0L, 0L, 0L, 2L), C_2_n = c(0L, 0L, 1L, 1L), C_3_l = c(0L, 
0L, 0L, 5L)), class = "data.frame", row.names = c(NA, -4L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM