簡體   English   中英

R 將 dataframe 轉換為每行每列的唯一成員列表

[英]R convert dataframe to list of unique memberships per column for each row

這就是我所擁有的:

> miniDF
      site1 site2 site3 site4 site5
Alpha     G     T     A     C     T
Beta      G     T     A     T     T
Delta     G     T     G     C     T
Gamma     G     C     A     T     T
Eps       G     T     A     T     T
Pi        A     T     A     T     T
Omi       G     T     A     C     A
miniDF = structure(list(site1 = c("G", "G", "G", "G", "G", "A", "G"), 
    site2 = c("T", "T", "T", "C", "T", "T", "T"), site3 = c("A", 
    "A", "G", "A", "A", "A", "A"), site4 = c("C", "T", "C", "T", 
    "T", "T", "C"), site5 = c("T", "T", "T", "T", "T", "T", "A"
    )), row.names = c("Alpha", "Beta", "Delta", "Gamma", "Eps", 
"Pi", "Omi"), class = "data.frame")

我想將其轉換為維恩圖的列表結構或打亂 plot ,其中該列中存在唯一字母將該站點放入列表行名稱中:

myList = list('Alpha'=c('site4'), 'Beta'=c(), 'Delta'=c('site3', 'site4'), 'Gamma'=c('site2'), 'Eps'=c(), 'Pi'=c('site1'), 'Omi'=c('site4','site5'))

Alpha 只有一個唯一站點(具有唯一單元格的列),Beta 沒有,但 Delta 和 Omi 有兩個唯一站點。

在此上下文中唯一意味着該單元格與該列中的其他單元格不同。 所以對於 site1,A 是唯一值(所有其他值都是 G),所以 Pi 將該站點包含在它的數組中。

對於有多個具有不同值的單元格的列,例如 site4,我將第一行的值作為唯一值,因此 Alpha、Delta 和 Omi 在其 arrays 中包含 site4。

假設我有幾百列。

我怎樣才能做到這一點?

這是tidyverse中的解決方案。

解決方案

首先導入tidyverse並生成數據集miniDF

library(tidyverse)

# ...
# Code to generate 'miniDF'.
# ...

然后定義自定義 function are_unique()以正確識別您認為“唯一”的每一列中的哪些值。

are_unique <- function(x) {
  # Return an empty logical vector for an empty input.
  if(length(x) < 1) {
    return(logical(0))
  }
  
  # Identify which values are properly unique.
  are_unique <- !x %in% x[duplicated(x)]
  
  # If unique values actually exist, return that identification as is...
  if(any(are_unique)) {
    return(are_unique)
  }
  
  # ...and otherwise default to the first value as "unique"...
  token_unique <- x[1]
  # ...and identify its every occurrence.
  x == token_unique
}

最后,應用這個整潔的工作流程:

miniDF %>%
  # Make the letters (row names) a column of their own.
  rownames_to_column("letter") %>%
  # In every other column, identify which values you consider "unique".
  mutate(across(!letter, are_unique)) %>%
  # Pivot into 'col_name | is_unique' format for easy filtration.
  pivot_longer(!letter, names_to = "col_name", values_to = "is_unique") %>%
  # Split by letter into a list, with the subset of rows for each letter.
  split(.$letter) %>%
  # Convert each subset into the vector of 'col_name's that filter as "unique".
  sapply(function(x){x$col_name[x$is_unique]})

結果

給定一個像你的樣本這樣的miniDF

miniDF <- structure(
  list(
    site1 = c("G", "G", "G", "G", "G", "A", "G"), 
    site2 = c("T", "T", "T", "C", "T", "T", "T"),
    site3 = c("A", "A", "G", "A", "A", "A", "A"),
    site4 = c("C", "T", "C", "T", "T", "T", "C"),
    site5 = c("T", "T", "T", "T", "T", "T", "A")
  ),
  row.names = c("Alpha", "Beta", "Delta", "Gamma", "Eps", "Pi", "Omi"),
  class = "data.frame"
)

此解決方案應生成以下list

list(
  Alpha = "site4",
  Beta  = character(0),
  Delta = c("site3", "site4"),
  Eps   = character(0),
  Gamma = "site2",
  Omi   = c("site4", "site5"),
  Pi    = "site1"
)

筆記

@GregorThomas答案可能會取代我自己的答案。 雖然我的答案在技術上首先發布,但我刪除了該答案以修復錯誤,並且在我最終取消刪除我的之前發布了Gregor 的功能解決方案。

無論如何,Gregor's 可能更優雅。

我們創建一個 function 來查找“唯一”值,然后將其應用於每一列,最后 go 通過每一行查看哪些列具有唯一值。

我只使用了base R。 如果我們切換到purrr函數,代碼可能會更簡潔,或者如果我們使用matrix而不是數據框,代碼可能會更高效。

pseudo_unique = function(x) {
  tx = sort(table(x))
  if(tx[1] == 1) return(names(tx[1])) else return(x[1])
}

u_vals = lapply(miniDF, pseudo_unique)
result = lapply(
  row.names(miniDF),
  \(row) names(miniDF)[which(unlist(Map("==", u_vals, miniDF[row, ])))]
)
names(result) = row.names(miniDF)  
result
# $Alpha
# [1] "site4"
# 
# $Beta
# character(0)
# 
# $Delta
# [1] "site3" "site4"
# 
# $Gamma
# [1] "site2"
# 
# $Eps
# character(0)
# 
# $Pi
# [1] "site1"
# 
# $Omi
# [1] "site4" "site5"

這是相同結果的矩陣版本。 有幾百列,我推薦這個版本。

miniMat = as.matrix(miniDF)
u_vals = apply(miniMat, 2, pseudo_unique)
result = apply(miniMat, 1, \(row) colnames(miniMat)[row == u_vals], simplify = FALSE)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM