简体   繁体   中英

Bin rows with very small row sums into a new combined row

In this data set, two Taxa (in Rows) contribute little to the overall data and i would like to gather all these rows, whose rowsums are less than n% of the entire dataset. n could be 1, 2, 3...

df <- data.frame(A=c(1000,100,1,0), B=c(100,1000,1,1), C=c(10,900,0,1))
row.names(df) <- c("Tax1", "Tax2", "Tax3", "Tax4") 


> df
      A    B    C
Tax1 1000  100  10
Tax2  100 1000 900
Tax3    1    1   0
Tax4    0    1   1

After identifying these low sum rows, i would like to bin them to eg "Other":

> df
      A    B   C
Tax1 1000  100  10
Tax2  100 1000 900
Other 1   2    1

Thank you!

#Set n
n <- 0.1 #10%
#Calculate proportions of their row sums
rows <- prop.table(rowSums(df)) < n
#combine the rows and add a new row with 'Other'
rbind(df[!rows, ], Other = colSums(df[rows, ]))

#         A    B   C
#Tax1  1000  100  10
#Tax2   100 1000 900
#Other    1    2   1

A tidyverse / dplyr approach using a couple of tibble functions

df <- data.frame(A=c(1000,100,1,0), B=c(100,1000,1,1), C=c(10,900,0,1))
row.names(df) <- c("Tax1", "Tax2", "Tax3", "Tax4")

library(tidyverse)
N <- 0.05 # 5 per cent

df %>% rownames_to_column('row') %>%
  filter(rowSums(cur_data()[-1]) >= N * sum(cur_data()[-1])) %>%
  bind_rows(df %>% rownames_to_column('row') %>%
              filter(rowSums(cur_data()[-1]) < N * sum(cur_data()[-1])) %>%
              summarise(across(-row, sum),
                        row = 'other')
              ) %>% column_to_rownames('row')

#>          A    B   C
#> Tax1  1000  100  10
#> Tax2   100 1000 900
#> other    1    2   1

Created on 2021-06-04 by the reprex package (v2.0.0)


dplyr only answer

df %>% filter(rowSums(cur_data()) >= N * sum(cur_data())) %>%
  bind_rows(df %>% 
              filter(rowSums(cur_data()) < N * sum(cur_data())) %>%
              summarise(across(everything(), sum)) %>% `row.names<-.data.frame`('Other')
              )

         A    B   C
Tax1  1000  100  10
Tax2   100 1000 900
Other    1    2   1

You can also use the following solution:

library(dplyr)
library(purrr)
library(tibble)

df %>% 
  filter(pmap_lgl(df, ~ sum(c(...)) >= 0.1 * sum(rowSums(df)))) %>%
  rownames_to_column() %>%
  bind_rows(df %>%
              filter(pmap_lgl(df, ~ sum(c(...)) < 0.1 * sum(rowSums(df)))) %>%
              summarise(across(A:C, ~ sum(.x)))) %>%
  replace_na(list(rowname = "Other"))


  rowname    A    B   C
1    Tax1 1000  100  10
2    Tax2  100 1000 900
3   Other    1    2   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM