簡體   English   中英

使用R基於數據幀中因素的迭代減法

[英]Iterative subtraction based on factors in a data frame using R

我正在努力為似乎很簡單的問題提供一個可行的解決方案。 我有一個同時包含數據和因子的數據框,並且我想使用這些因子來確定需要從其他數據點中減去哪些數據點才能生成具有比較值的新數據幀。

數據框如下所示:

str(means)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 32 obs. of  5 variables:
 $ rat          : Factor w/ 8 levels "Rat1","Rat2",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ gene         : Factor w/ 4 levels "gene1","gene2",..: 1 2 3 4 1 2 3 4 1 2 ...
 $ gene_category: Factor w/ 2 levels "control","experimental": 2 2 1 1 2 2 1 1 2 2 ...
 $ timepoint1   : num  23.4 18.3 42.1 40.1 25.3 ...
 $ timepoint2   : num  23.5 18.4 41.5 39.9 22.8 ...
> head(means)
Source: local data frame [6 x 5]
Groups: rat, gene [6]

 rat   gene gene_category timepoint1 timepoint2
(fctr) (fctr)        (fctr)      (dbl)      (dbl)
1   Rat1  gene1  experimental   23.36667   23.49667
2   Rat1  gene2  experimental   18.26000   18.38000
3   Rat1  gene3       control   42.05500   41.45000
4   Rat1  gene4       control   40.08667   39.89500
5   Rat2  gene1  experimental   25.29333   22.83000
6   Rat2  gene2  experimental   19.72667   19.19333

對於每只大鼠(總共8只大鼠),我想從“實驗”基因值(基因1和2)中減去“對照”基因值(基因3和4)。 我需要反復進行此操作,因此每個實驗基因值都必須減去每個對照基因值(在每只大鼠內,但不在大鼠之間)。 對於每個時間點列均應執行上述操作。

我一直在擺弄一個使用dplyr的解決方案,我已經將分組歸類了,但是我不知道如何做其余的事情:

diffs <- means %>% group_by(rat, gene, gene_category) %>% here_is_where_i_don't_know_what_to_do)

這里有一個針對類似問題的解決方案,但我認為它將為我提供所有成對的操作,而這並不是我想要的。 它也只涉及兩個因素,而我有三個我需要考慮。

這是解決類似問題的另一種方法 ,但是同樣有一些事情使它不理想。 它僅處理一個因素,我不確定如何將其應用於具有三個因素和兩個數據向量的數據集。

我知道在執行類似成對比較的方法來確定統計顯着性(多個t檢驗,ANOVA,MANOVA等)時,此問題已解決,但是我熟悉的軟件包/基本stat函數執行這些測試可以保持這一基本在引擎蓋下操作。 我想要一個簡單的解決方案,該方案使用基數R或dplyr / plyr / reshape2等使用盡可能少的循環。

我認為解決方案將包括生成你想要的比較,然后將它們傳遞給一個標准的評價mutate_ ,而不是與戰斗group_bysummarize

首先,這里是讀入的數據(注意,為rat2添加了基因3/4):

means <-
  read.table(text =
" rat   gene gene_category timepoint1 timepoint2
1   Rat1  gene1  experimental   23.36667   23.49667
2   Rat1  gene2  experimental   18.26000   18.38000
3   Rat1  gene3       control   42.05500   41.45000
4   Rat1  gene4       control   40.08667   39.89500
5   Rat2  gene1  experimental   25.29333   22.83000
6   Rat2  gene2  experimental   19.72667   19.19333
7   Rat2  gene3       control   42.05500   41.45000
8   Rat2  gene4       control   40.08667   39.89500")

接下來,在每個類別中生成一組基因:

geneLists <-
  means %>%
  {split(.$gene, .$`gene_category`)} %>%
  lapply(unique) %>%
  lapply(as.character) %>%
  lapply(function(x){paste0("`", x, "`")})

注意,反引號“`”是為了防止可能無效的列名(例如,帶有空格的東西)。 這給出:

$control
[1] "`gene3`" "`gene4`"

$experimental
[1] "`gene1`" "`gene2`"

然后,將所需的比較粘貼在一起:

colsToCreate <-
  outer(geneLists[["experimental"]]
        , geneLists[["control"]]
        , paste, sep = " - ") %>%
  as.character()

給予:

[1] "`gene1` - `gene3`" "`gene2` - `gene3`" "`gene1` - `gene4`" "`gene2` - `gene4`"

然后,使用tidyr傳播數據,每只老鼠產生一行。 注意,如果要同時傳播timepoint1timepoint2 ,則可能需要先gather (兩次都放入一列中),然后創建一個同時包含time和gene的id列,然后使用該id列進行spread 這還需要更改colsToCreate構造。

散布后,傳遞要生成的列向量,您應該具有所需的內容:

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate)

瞧:

   rat    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000

實際上,獲得兩個時間點甚至比我想象的要容易得多:

means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate)

給出:

   rat  timepoint    gene1    gene2  gene3    gene4 gene1 - gene3 gene2 - gene3 gene1 - gene4 gene2 - gene4
1 Rat1 timepoint1 23.36667 18.26000 42.055 40.08667     -18.68833     -23.79500     -16.72000     -21.82667
2 Rat1 timepoint2 23.49667 18.38000 41.450 39.89500     -17.95333     -23.07000     -16.39833     -21.51500
3 Rat2 timepoint1 25.29333 19.72667 42.055 40.08667     -16.76167     -22.32833     -14.79334     -20.36000
4 Rat2 timepoint2 22.83000 19.19333 41.450 39.89500     -18.62000     -22.25667     -17.06500     -20.70167

還要注意,您可以命名包含列計算公式的向量,例如:

colsToCreate2 <-
  setNames(colsToCreate
           , c("nameA", "nameB", "nameC", "nameD"))

means %>%
  select(rat, gene, timepoint1) %>%
  spread(gene, timepoint1) %>%
  mutate_(.dots = colsToCreate2)

給出:

   rat    gene1    gene2  gene3    gene4     nameA     nameB     nameC     nameD
1 Rat1 23.36667 18.26000 42.055 40.08667 -18.68833 -23.79500 -16.72000 -21.82667
2 Rat2 25.29333 19.72667 42.055 40.08667 -16.76167 -22.32833 -14.79334 -20.36000

我不確定為什么,但是這個問題使我很興奮,以至於我想完成這個想法。 在這里,我gather的比較回長形式,然后mutate的時間點到使用許多parse_numberreadrseparate出所比較的基因導入單獨的列,以允許高效的訪問和濾波。 請注意,每個基因的重復使用消除了獨立性的假設,因此,如果沒有非常仔細地考慮控制問題,就不要對這些基因進行統計。

longForm <-
  means %>%
  select(-gene_category) %>%
  gather("timepoint", "value", starts_with("timepoint")) %>%
  spread(gene, value) %>%
  mutate_(.dots = colsToCreate) %>%
  select_(.dots = paste0("-",unlist(geneLists))) %>%
  gather(Comparison, Difference, -rat, -timepoint) %>%
  mutate(time = parse_number(timepoint)) %>%
  separate(Comparison, c("exp_Gene", "cont_Gene"), " - ")

head(longForm)

   rat  timepoint exp_Gene cont_Gene Difference time
1 Rat1 timepoint1    gene1     gene3  -18.68833    1
2 Rat1 timepoint2    gene1     gene3  -17.95333    2
3 Rat2 timepoint1    gene1     gene3  -16.76167    1
4 Rat2 timepoint2    gene1     gene3  -18.62000    2
5 Rat1 timepoint1    gene2     gene3  -23.79500    1
6 Rat1 timepoint2    gene2     gene3  -23.07000    2

然后,我們可以繪制結果:

longForm %>%
  ggplot(aes(x = time
             , y = Difference
             , col = rat)) +
  geom_line() +
  facet_grid(exp_Gene ~ cont_Gene)

在此處輸入圖片說明

下面是一個使用該解決方案最新的開發人員版本的(1.9.7+) data.table

library(data.table)
setDT(means)

# join on rat being same and gene categories not being same, discard unmatched rows
# then extract interesting columns
means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
      .(rat, gene.exp = gene, gene.ctrl = i.gene,
        timediff1 = timepoint1 - i.timepoint1, timediff2 = timepoint2 - i.timepoint2)]
#    rat gene.exp gene.ctrl timediff1 timediff2
#1: Rat1    gene1     gene3 -18.68833 -17.95333
#2: Rat1    gene2     gene3 -23.79500 -23.07000
#3: Rat1    gene1     gene4 -16.72000 -16.39833
#4: Rat1    gene2     gene4 -21.82667 -21.51500
#5: Rat2    gene1     gene3 -16.76167 -18.62000
#6: Rat2    gene2     gene3 -22.32833 -22.25667
#7: Rat2    gene1     gene4 -14.79334 -17.06500
#8: Rat2    gene2     gene4 -20.36000 -20.70167

並且如果要歸納為任意數量的“時間點”列:

nm = grep("timepoint", names(means), value = T)

means[means, on = .(rat, gene_category > gene_category), nomatch = 0,
      c(.(rat = rat, gene.exp = gene, gene.ctrl = i.gene),
        setDT(mget(nm)) - mget(paste0('i.', nm)))]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM