如何獲取列的最小值和最大值？

Question

我有一個數據集，其中的值已折疊，因此每一行的每一列都有多個輸入。

例如：

Gene   Score1                      
Gene1  NA, NA, NA, 0.03, -0.3 
Gene2  NA, 0.2, 0.1, ., .

我希望創建 2 個新列，即 select 該列的最小值和最大值。 實際上我有 70 列，所以我編寫代碼以一次獲取所有最小和最大列：

get_range <- function(x) {
  x <- type.convert(str_split(x, ",\\s+", simplify = TRUE), na.strings = ".")
  x <- t(apply(x, 1L, function(i) {
    i <- i[!is.na(i)]
    if (length(i) < 1L) c(NA_real_, NA_real_) else range(i)
  }))
  dimnames(x)[[2L]] <- c("min", "max")
  x
}

dt <- dt[, c(Gene = .(Gene), lapply(.SD, get_range)), .SDcols = -"Gene"]

但是，我從代碼輸出的最小和最大列如下所示：

Gene   Score1.min  Score1.max                     
Gene1    1             5 
Gene2    3             5

預期的 output 實際上是：

Gene   Score1.min  Score1.max                     
Gene1    -0.3          0.03 
Gene2    0.1           0.2

這些值與我開始時的實際值完全不同，我不知道我的代碼是如何將這些值作為 output 的 - 我的代碼是否有某些東西使這些值不再被視為它們最初的數字？

輸入數據：

structure(list(Gene = c("Gene1", "Gene2"), Score1 = c("NA, NA, NA, 0.03, -0.3", 
"NA, 0.2, 0.1, ., .")), row.names = c(NA, -2L), class = c("data.table", 
"data.frame"))

Answer 1

type.convert僅將na.strings中的字符串視為缺失值。 默認情況下，這是"NA" 。 你設置na.strings = "." ，這意味着"NA"不再算作缺失。 相反，您需要na.strings = c(".", "NA")因為兩者都出現在您的數據中。

## The string split result is `character`, of course, with both `"."` and `"NA"` values
(ss = str_split(dt$Score1, ",\\s+", simplify = TRUE))
#      [,1] [,2]  [,3]  [,4]   [,5]  
# [1,] "NA" "NA"  "NA"  "0.03" "-0.3"
# [2,] "NA" "0.2" "0.1" "."    "."   

## What you have creates a factor with `"NA"` as a level
type.convert(ss, na.strings = c("."))
#      [,1] [,2] [,3] [,4] [,5]
# [1,] NA   NA   NA   0.03 -0.3
# [2,] NA   0.2  0.1  <NA> <NA>
# Levels: -0.3 0.03 0.1 0.2 NA

## Here is the solution to get it to be numeric with `type.convert`
type.convert(ss, na.strings = c(".", "NA"))
#      [,1] [,2] [,3] [,4] [,5]
# [1,]   NA   NA   NA 0.03 -0.3
# [2,]   NA  0.2  0.1   NA   NA

Answer 2

function 與 data.table 的 tstrsplit 相當方便：

library(data.table)
dt <- data.table(
  Gene = c("Gene1", "Gene2"), 
  Score1 = c("NA, NA, NA, 0.03, -0.3", "NA, 0.2, 0.1, ., .")
)
# split the Score 1, transpose it and create one row per (gene, score)
# as mentioned earlier, force as numeric using as.numeric
dt <- dt[, .(Score1 = as.numeric(unlist(tstrsplit(Score1, ",")))), by = Gene]

# then take the min and the max per gene
dt[ , .(Min = min(Score1, na.rm = T), Max = max(Score1, na.rm = T)), by= Gene]

如何獲取列的最小值和最大值？

問題描述

2 個解決方案

解決方案1
2 2021-01-28 16:10:18

解決方案2
1 2021-01-29 06:37:01

如何獲取列的最小值和最大值？

問題描述

2 個解決方案

解決方案1 2 2021-01-28 16:10:18

解決方案2 1 2021-01-29 06:37:01

解決方案1
2 2021-01-28 16:10:18

解決方案2
1 2021-01-29 06:37:01