简体   繁体   中英

How can I normalize each column value by the sum of its column with data.table library?

I have a data table shown below (dt <- fread("~/data.csv"):

chr    gene_id       S1         S2         S3           S4
chr1      a          30         50         70           90
chr2      b          40         60         80           100
chr3      c          50         70         90           120
chr4      d          60         80         100          130

As a final output, I want to normalize each column value by the sum of its column.

chr    gene_id       S1              S2           S3            S4
chr1     a        30/sum(S1)     50/sum(S2)    70/sum(S3)   90/sum(S4)
chr2     b        40/sum(S1)     60/sum(S2)    80/sum(S3)  100/sum(S4)
chr3     c        50/sum(S1)     70/sum(S2)    90/sum(S3)  120/sum(S4)
chr4     d        60/sum(S1)     80/sum(S2)   100/sum(S3)  130/sum(S4)

How can I get the final output above using data.table not dplyr from tidyverse?

You will only want to apply this change to numeric columns, so those must be identified. Then use lapply. This creates a NEW data.table. Chinsoon's answer will update by reference - which may be better depending on the scenario.

library(data.table)
iris_example <- data.table(iris)

numeric_cols <- names(iris)[sapply(iris, is.numeric)]

iris_example[, lapply(.SD, function(x) x/sum(x)), .SDcols = numeric_cols]

# if you want to update by reference
iris_example[, (numeric_cols) := lapply(.SD, function(x) x/sum(x)), .SDcols = numeric_cols]

You can use get as follows:

cols <- paste0("S", 1:4)
dt[, (cols) := lapply(cols, function(x) get(x) / sum(get(x)))]

edit: added some timings for future reference

library(data.table) 

DT <- as.data.table(matrix(rnorm(4*1e7), ncol=4))
#DT <- as.data.table(matrix(rnorm(4*1e4), nrow=4))

library(microbenchmark)
microbenchmark(

    DT[, lapply(.SD, function(x) x/sum(x)), .SDcols=names(DT)],

    DT[, .SD / lapply(.SD, sum), .SDcols=names(DT)],

    DT[, lapply(names(DT), function(x) get(x) / sum(get(x)))],

    #if you change ncol to nrow in matrix function above, you will understand why pple always ask you to go long format
    melt(DT)[, value / sum(value), by=variable],#[,
        #lapply(names(DT), function(x) .SD[variable==x, V1])],

    times=3L
)

timings:

Unit: milliseconds
                                                         expr       min        lq      mean    median        uq       max neval
 DT[, lapply(.SD, function(x) x/sum(x)), .SDcols = names(DT)]  245.9107  250.3230  278.4220  254.7353  294.6776  334.6200     3
              DT[, .SD/lapply(.SD, sum), .SDcols = names(DT)]  534.5416  540.0630  602.4990  545.5845  636.4777  727.3708     3
      DT[, lapply(names(DT), function(x) get(x)/sum(get(x)))]  248.1819  381.0036  440.7233  513.8254  536.9940  560.1625     3
                  melt(DT)[, value/sum(value), by = variable] 1219.2584 1250.5846 1425.0604 1281.9107 1527.9614 1774.0120     3

sessionInfo:

data.table_1.10.4-3 
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)

We can also consider the following method, which will work on the base R data.frame, too.

library(data.table)

# Create example data.table
DT <- fread(
  "chr    gene_id       S1         S2         S3           S4
chr1      a          30         50         70           90
  chr2      b          40         60         80           100
  chr3      c          50         70         90           120
  chr4      d          60         80         100          130"
)

# Use lapply to loop through column 3 to 6
DT[, 3:6] <- lapply(DT[, 3:6], function(x) x/sum(x))

print(DT)
#     chr gene_id        S1        S2        S3        S4
# 1: chr1       a 0.1666667 0.1923077 0.2058824 0.2045455
# 2: chr2       b 0.2222222 0.2307692 0.2352941 0.2272727
# 3: chr3       c 0.2777778 0.2692308 0.2647059 0.2727273
# 4: chr4       d 0.3333333 0.3076923 0.2941176 0.2954545

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM