[英]R convert dataframe to list of unique memberships per column for each row
這就是我所擁有的:
> miniDF
site1 site2 site3 site4 site5
Alpha G T A C T
Beta G T A T T
Delta G T G C T
Gamma G C A T T
Eps G T A T T
Pi A T A T T
Omi G T A C A
miniDF = structure(list(site1 = c("G", "G", "G", "G", "G", "A", "G"),
site2 = c("T", "T", "T", "C", "T", "T", "T"), site3 = c("A",
"A", "G", "A", "A", "A", "A"), site4 = c("C", "T", "C", "T",
"T", "T", "C"), site5 = c("T", "T", "T", "T", "T", "T", "A"
)), row.names = c("Alpha", "Beta", "Delta", "Gamma", "Eps",
"Pi", "Omi"), class = "data.frame")
我想將其轉換為維恩圖的列表結構或打亂 plot ,其中該列中存在唯一字母將該站點放入列表行名稱中:
myList = list('Alpha'=c('site4'), 'Beta'=c(), 'Delta'=c('site3', 'site4'), 'Gamma'=c('site2'), 'Eps'=c(), 'Pi'=c('site1'), 'Omi'=c('site4','site5'))
Alpha 只有一個唯一站點(具有唯一單元格的列),Beta 沒有,但 Delta 和 Omi 有兩個唯一站點。
在此上下文中唯一意味着該單元格與該列中的其他單元格不同。 所以對於 site1,A 是唯一值(所有其他值都是 G),所以 Pi 將該站點包含在它的數組中。
對於有多個具有不同值的單元格的列,例如 site4,我將第一行的值作為唯一值,因此 Alpha、Delta 和 Omi 在其 arrays 中包含 site4。
假設我有幾百列。
我怎樣才能做到這一點?
這是tidyverse
中的解決方案。
首先導入tidyverse
並生成數據集miniDF
。
library(tidyverse)
# ...
# Code to generate 'miniDF'.
# ...
然后定義自定義 function are_unique()
以正確識別您認為“唯一”的每一列中的哪些值。
are_unique <- function(x) {
# Return an empty logical vector for an empty input.
if(length(x) < 1) {
return(logical(0))
}
# Identify which values are properly unique.
are_unique <- !x %in% x[duplicated(x)]
# If unique values actually exist, return that identification as is...
if(any(are_unique)) {
return(are_unique)
}
# ...and otherwise default to the first value as "unique"...
token_unique <- x[1]
# ...and identify its every occurrence.
x == token_unique
}
最后,應用這個整潔的工作流程:
miniDF %>%
# Make the letters (row names) a column of their own.
rownames_to_column("letter") %>%
# In every other column, identify which values you consider "unique".
mutate(across(!letter, are_unique)) %>%
# Pivot into 'col_name | is_unique' format for easy filtration.
pivot_longer(!letter, names_to = "col_name", values_to = "is_unique") %>%
# Split by letter into a list, with the subset of rows for each letter.
split(.$letter) %>%
# Convert each subset into the vector of 'col_name's that filter as "unique".
sapply(function(x){x$col_name[x$is_unique]})
給定一個像你的樣本這樣的miniDF
miniDF <- structure(
list(
site1 = c("G", "G", "G", "G", "G", "A", "G"),
site2 = c("T", "T", "T", "C", "T", "T", "T"),
site3 = c("A", "A", "G", "A", "A", "A", "A"),
site4 = c("C", "T", "C", "T", "T", "T", "C"),
site5 = c("T", "T", "T", "T", "T", "T", "A")
),
row.names = c("Alpha", "Beta", "Delta", "Gamma", "Eps", "Pi", "Omi"),
class = "data.frame"
)
此解決方案應生成以下list
:
list(
Alpha = "site4",
Beta = character(0),
Delta = c("site3", "site4"),
Eps = character(0),
Gamma = "site2",
Omi = c("site4", "site5"),
Pi = "site1"
)
@GregorThomas的答案可能會取代我自己的答案。 雖然我的答案在技術上首先發布,但我刪除了該答案以修復錯誤,並且在我最終取消刪除我的之前發布了Gregor 的功能解決方案。
無論如何,Gregor's 可能更優雅。
我們創建一個 function 來查找“唯一”值,然后將其應用於每一列,最后 go 通過每一行查看哪些列具有唯一值。
我只使用了base
R。 如果我們切換到purrr
函數,代碼可能會更簡潔,或者如果我們使用matrix
而不是數據框,代碼可能會更高效。
pseudo_unique = function(x) {
tx = sort(table(x))
if(tx[1] == 1) return(names(tx[1])) else return(x[1])
}
u_vals = lapply(miniDF, pseudo_unique)
result = lapply(
row.names(miniDF),
\(row) names(miniDF)[which(unlist(Map("==", u_vals, miniDF[row, ])))]
)
names(result) = row.names(miniDF)
result
# $Alpha
# [1] "site4"
#
# $Beta
# character(0)
#
# $Delta
# [1] "site3" "site4"
#
# $Gamma
# [1] "site2"
#
# $Eps
# character(0)
#
# $Pi
# [1] "site1"
#
# $Omi
# [1] "site4" "site5"
這是相同結果的矩陣版本。 有幾百列,我推薦這個版本。
miniMat = as.matrix(miniDF)
u_vals = apply(miniMat, 2, pseudo_unique)
result = apply(miniMat, 1, \(row) colnames(miniMat)[row == u_vals], simplify = FALSE)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.