[英]How to get the min and max values of a column?
I have a dataset where the values are collapsed so each row has multiple inputs per one column.我有一个数据集,其中的值已折叠,因此每一行的每一列都有多个输入。
For example:例如:
Gene Score1
Gene1 NA, NA, NA, 0.03, -0.3
Gene2 NA, 0.2, 0.1, ., .
I am looking to make 2 new columns that select the min and max values of that column.我希望创建 2 个新列,即 select 该列的最小值和最大值。 In reality I have 70 columns so I coded to get all the min and max columns at once with:实际上我有 70 列,所以我编写代码以一次获取所有最小和最大列:
get_range <- function(x) {
x <- type.convert(str_split(x, ",\\s+", simplify = TRUE), na.strings = ".")
x <- t(apply(x, 1L, function(i) {
i <- i[!is.na(i)]
if (length(i) < 1L) c(NA_real_, NA_real_) else range(i)
}))
dimnames(x)[[2L]] <- c("min", "max")
x
}
dt <- dt[, c(Gene = .(Gene), lapply(.SD, get_range)), .SDcols = -"Gene"]
However, my min and max columns outputted from the code look like this:但是,我从代码输出的最小和最大列如下所示:
Gene Score1.min Score1.max
Gene1 1 5
Gene2 3 5
Expected output actually is:预期的 output 实际上是:
Gene Score1.min Score1.max
Gene1 -0.3 0.03
Gene2 0.1 0.2
The values are nothing like the actual values I had at the start, I have no idea how my code is getting these as the output - is there something my code making the values no longer be treated as the numbers they originally were?这些值与我开始时的实际值完全不同,我不知道我的代码是如何将这些值作为 output 的 - 我的代码是否有某些东西使这些值不再被视为它们最初的数字?
Input data:输入数据:
structure(list(Gene = c("Gene1", "Gene2"), Score1 = c("NA, NA, NA, 0.03, -0.3",
"NA, 0.2, 0.1, ., .")), row.names = c(NA, -2L), class = c("data.table",
"data.frame"))
type.convert
only considers strings in na.strings
as missing values. type.convert
仅将na.strings
中的字符串视为缺失值。 By default, this is "NA"
.默认情况下,这是"NA"
。 You set na.strings = "."
你设置na.strings = "."
, which means "NA"
are no longer counted as missing. ,这意味着"NA"
不再算作缺失。 Instead, you need na.strings = c(".", "NA")
because both appear in your data.相反,您需要na.strings = c(".", "NA")
因为两者都出现在您的数据中。
## The string split result is `character`, of course, with both `"."` and `"NA"` values
(ss = str_split(dt$Score1, ",\\s+", simplify = TRUE))
# [,1] [,2] [,3] [,4] [,5]
# [1,] "NA" "NA" "NA" "0.03" "-0.3"
# [2,] "NA" "0.2" "0.1" "." "."
## What you have creates a factor with `"NA"` as a level
type.convert(ss, na.strings = c("."))
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA NA NA 0.03 -0.3
# [2,] NA 0.2 0.1 <NA> <NA>
# Levels: -0.3 0.03 0.1 0.2 NA
## Here is the solution to get it to be numeric with `type.convert`
type.convert(ss, na.strings = c(".", "NA"))
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA NA NA 0.03 -0.3
# [2,] NA 0.2 0.1 NA NA
The function tstrsplit with data.table is quite convenient: function 与 data.table 的 tstrsplit 相当方便:
library(data.table)
dt <- data.table(
Gene = c("Gene1", "Gene2"),
Score1 = c("NA, NA, NA, 0.03, -0.3", "NA, 0.2, 0.1, ., .")
)
# split the Score 1, transpose it and create one row per (gene, score)
# as mentioned earlier, force as numeric using as.numeric
dt <- dt[, .(Score1 = as.numeric(unlist(tstrsplit(Score1, ",")))), by = Gene]
# then take the min and the max per gene
dt[ , .(Min = min(Score1, na.rm = T), Max = max(Score1, na.rm = T)), by= Gene]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.