[英]How to convert character column contraining “0/0” to numeric
我有一個數據框,所有列都歸類為一個字符。 每列包含一個分數。 我想將列轉換為整數,但有些列有分數“0/0”,R 似乎不喜歡。 我嘗試了以下但得到了
df2 <- as.numeric(df)
並得到以下
Error: (list) object cannot be coerced to type 'double'
我找不到解釋如何將包含“0/0”小數部分的字符轉換為數值 0 的帖子。我意識到 R 給我帶來麻煩是有原因的由零。 我只是在處理遺傳數據,將數據轉換為數字並將所有內容相加比進行某種替換 function 容易得多。 實際的數據框是 10 幾百萬行和 500+ 行。
這是一個示例數據框
df <- structure(list(`GEN[5].GT` = c("0/1", "0/0", "0/0", "0/0",
"0/1", "0/0", "0/0", "1/1", "0/0", "0/0"), `GEN[1].GT` = c("0/0",
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0"
), `GEN[6].GT` = c("1/1", "0/0", "0/0", "0/0", "0/0", "0/0",
"0/1", "0/0", "0/0", "0/0"), `GEN[9].GT` = c("0/0", "0/0",
"0/0", "0/0", "0/1", "0/0", "0/0", "0/1", "0/0", "0/0"), `GEN[89].GT` = c("0/0",
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0"
), `GEN[453].GT` = c("0/0", "0/0", "0/1", "0/0", "0/0", "0/0",
"0/0", "0/0", "0/0", "0/0"), `GEN[554].GT` = c("0/0", "0/0",
"0/0", "0/0", "0/0", "0/0", "1/1", "0/0", "0/0", "0/0"), `GEN[9864].GT` = c("0/0",
"0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0", "0/0"
), `GEN[1234].GT` = c("1/1", "0/0", "0/0", "0/0", "0/0", "0/0",
"0/0", "0/0", "0/0", "0/0"), `GEN[3333].GT` = c("0/0", "0/0",
"0/0", "0/0", "0/0", "1/1", "0/0", "0/1", "0/0", "0/0")), row.names = c(NA,
10L), class = "data.frame")
# Expected output
df2 <- structure(list(`GEN[5].GT` = c("0.5", "0", "0", "0",
"0.5", "0", "0", "1", "0", "0"), `GEN[1].GT` = c("0",
"0", "0", "0", "0", "0", "0", "0", "0", "0"
), `GEN[6].GT` = c("1", "0", "0", "0", "0", "0",
"0.5", "0", "0", "0"), `GEN[9].GT` = c("0", "0",
"0", "0", "0.5", "0", "0", "0.5", "0", "0"), `GEN[89].GT` = c("0",
"0", "0", "0", "0", "0", "0", "0", "0", "0"
), `GEN[453].GT` = c("0", "0", "0.5", "0", "0", "0",
"0", "0", "0", "0"), `GEN[554].GT` = c("0", "0",
"0", "0", "0", "0", "1", "0", "0", "0"), `GEN[9864].GT` = c("0",
"0", "0", "0", "0", "0", "0", "0", "0", "0"
), `GEN[1234].GT` = c("1", "0", "0", "0", "0", "0",
"0", "0", "0", "0"), `GEN[3333].GT` = c("0", "0",
"0", "0", "0", "1", "0", "0.5", "0", "0")), row.names = c(NA,
10L), class = "data.frame")
我們可以創建一個行名列(來自rownames_to_column
的tibble
),然后在分隔符( /
)處使用separate_rows
的行分割每一列,同時自動convert
類型,按 'rn' 分組,得到每一列的mean
library(dplyr)
library(tibble)
library(tidyr)
df %>%
rownames_to_column('rn') %>%
separate_rows(-1, convert = TRUE) %>%
group_by(rn) %>%
summarise_all(mean) %>%
select(-rn)
# A tibble: 10 x 10
# `GEN[5].GT` `GEN[1].GT` `GEN[6].GT` `GEN[9].GT` `GEN[89].GT` `GEN[453].GT` `GEN[554].GT` `GEN[9864].GT` `GEN[1234].GT` `GEN[3333].GT`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.5 0 1 0 0 0 0 0 1 0
# 2 0 0 0 0 0 0 0 0 0 0
# 3 0 0 0 0 0 0 0 0 0 0
# 4 0 0 0 0 0 0.5 0 0 0 0
# 5 0 0 0 0 0 0 0 0 0 0
# 6 0.5 0 0 0.5 0 0 0 0 0 0
# 7 0 0 0 0 0 0 0 0 0 1
# 8 0 0 0.5 0 0 0 1 0 0 0
# 9 1 0 0 0.5 0 0 0 0 0 0.5
#10 0 0 0 0 0 0 0 0 0 0
或者@IceCreamToucan 在評論中提到的另一個選項,它用strsplit
拆分單元格並取mean
library(purrr)
df %>%
mutate_all(~ map_dbl(strsplit(., '/'), ~ mean(as.numeric(.))))
或者另一個可能更有效的選擇(在base R
中)是傳遞一個命名向量,通過匹配替換值
nm1 <- setNames(c(0, 0.5, 0.5, 1), c("0/0", "1/0", "0/1", "1/1"))
df[] <- lapply(df, function(x) nm1[x])
df
# GEN[5].GT GEN[1].GT GEN[6].GT GEN[9].GT GEN[89].GT GEN[453].GT GEN[554].GT GEN[9864].GT GEN[1234].GT GEN[3333].GT
#1 0.5 0 1.0 0.0 0 0.0 0 0 1 0.0
#2 0.0 0 0.0 0.0 0 0.0 0 0 0 0.0
#3 0.0 0 0.0 0.0 0 0.5 0 0 0 0.0
#4 0.0 0 0.0 0.0 0 0.0 0 0 0 0.0
#5 0.5 0 0.0 0.5 0 0.0 0 0 0 0.0
#6 0.0 0 0.0 0.0 0 0.0 0 0 0 1.0
#7 0.0 0 0.5 0.0 0 0.0 1 0 0 0.0
#8 1.0 0 0.0 0.5 0 0.0 0 0 0 0.5
#9 0.0 0 0.0 0.0 0 0.0 0 0 0 0.0
#10 0.0 0 0.0 0.0 0 0.0 0 0 0 0.0
我們可以使用gsub
來截取兩邊並將它們放在括號內,將/
替換為+
,然后除以 2。
library(dplyr)
df %>%
rowwise() %>%
#try mutate_all(~gsub('(.*)/(.*)','(\\1+\\2)/2',.)) to see the underlying formula
mutate_all(~eval(parse(text=gsub('(.*)/(.*)','(\\1+\\2)/2',.)))) %>%
ungroup()
# A tibble: 10 x 10
`GEN[5].GT` `GEN[1].GT` `GEN[6].GT` `GEN[9].GT` `GEN[89].GT` `GEN[453].GT` `GEN[554].GT` `GEN[9864].GT` `GEN[1234].GT`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.5 0 1 0 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0.5 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0.5 0 0 0.5 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0.5 0 0 0 1 0 0
8 1 0 0 0.5 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0
# ... with 1 more variable: `GEN[3333].GT` <dbl>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.