简体   繁体   English

R-按键和更新值在数据帧中添加缺失对

[英]R - Add Missing Pairs in Data Frame by Key and Update Value

I have a data frame with various subjects, each of whom contributed at least one tissue sample, ie Blood, Heart, Liver, etc, while many of them contributed samples of multiple tissues. 我有一个包含各种主题的数据框,每个主题都贡献了至少一个组织样本,即血液,心脏,肝脏等,而许多主题贡献了多个组织的样本。 There are 31 unique tissues, and I want to create a 31 x 31 matrix indicating tissues pairs collected from a single subject. 有31个独特的组织,我想创建一个31 x 31的矩阵,指示从单个对象收集的组织对。 With row and column names being the names of the tissues, then, the diagonals would give the total number of subjects from whom a tissue sample was collected, and the off diagonals would include the number of subjects who had given both (ie, if a subject had given a heart and lung sample, the intersection of the heart row/column and lung column/row would increase by 1). 将行名和列名作为组织的名称,然后,对角线将给出从中收集组织样本的受试者总数,而对角线将包括同时给予两者的受试者的数量(即,如果如果受试者提供了心脏和肺部样本,则心脏行/列与肺部列/行的交点将增加1)。

So far, I have been able to get the data (using plyr ) into a data frame counts that includes each unique pair found, along with the number of subjects who have contributed both tissue types. 到目前为止,我已经能够将数据(使用plyr )获取到一个数据帧counts ,该counts包括找到的每个唯一对,以及贡献了这两种组织类型的受试者的数量。 When SMTS1 and SMTS2 match, the value in Count indicates the total number of samples of that tissue SMTS1SMTS2匹配时, Count的值表示该组织的样本总数

> head(counts, n = 32L)
        SMTS1           SMTS2      Count
1  Adipose Tissue  Adipose Tissue   439
2  Adipose Tissue   Adrenal Gland   137
3  Adipose Tissue         Bladder    11
4  Adipose Tissue           Blood   423
5  Adipose Tissue    Blood Vessel   368
6  Adipose Tissue           Brain   146
7  Adipose Tissue          Breast   190
8  Adipose Tissue    Cervix Uteri     8
9  Adipose Tissue           Colon   248
10 Adipose Tissue       Esophagus   341
11 Adipose Tissue  Fallopian Tube     6
12 Adipose Tissue           Heart   266
13 Adipose Tissue          Kidney    33
14 Adipose Tissue           Liver   119
15 Adipose Tissue            Lung   285
16 Adipose Tissue          Muscle   380
17 Adipose Tissue           Nerve   290
18 Adipose Tissue           Ovary    99
19 Adipose Tissue        Pancreas   174
20 Adipose Tissue       Pituitary   102
21 Adipose Tissue        Prostate   105
22 Adipose Tissue  Salivary Gland    64
23 Adipose Tissue            Skin   423
24 Adipose Tissue Small Intestine    97
25 Adipose Tissue          Spleen   110
26 Adipose Tissue         Stomach   182
27 Adipose Tissue          Testis   168
28 Adipose Tissue         Thyroid   290
29 Adipose Tissue          Uterus    81
30 Adipose Tissue          Vagina    86
31  Adrenal Gland  Adipose Tissue   137
32  Adrenal Gland   Adrenal Gland   159
... [823 Additional Rows]

The way this is set up, each of the 31 tissues is present in counts$SMTS1 , and counts$SMTS2 contains all of the tissues for which a pair exists. 设置方式,31个组织中的每一个都以counts$SMTS1counts$SMTS2包含一对存在的所有组织。 You'll see for Adipose Tissue, there are only 30 entries, indicating that there is one tissue type that is not found with Adipose Tissue. 您会看到“脂肪组织”只有30个条目,这表明“脂肪组织”中找不到一种组织类型。

What I would like to do is make it so that each unique value in SMTS1 is paired with each of the 31 possible tissues. 我想做的是使SMTS1中的每个唯一值与31种可能的组织中的每一个配对。 In the case shown, for example, Adipose Tissue only has 30 pairs, indicating that one pair does not exist. 在所示的情况下,例如,脂肪组织只有30对,表示不存在一对。 In this case, that pair is Bone Marrow. 在这种情况下,那对就是骨髓。 What I would like, then, is for my counts data frame, upon recognizing that, create two additional rows 然后,我想对我的counts数据框进行识别后,再创建两行

        SMTS1           SMTS2       Count
1  Adipose Tissue     Bone Marrow     0
2    Bone Marrow     Adipose Tissue   0

giving 0 values indicating that a pair doesn't exist. 给出0值表示一对不存在。 From there, the, I should have 961 numeric values, which will ultimately end up being the entries for my 31 x 31 matrix. 从那里,我应该有961个数值,这些数值最终将最终成为我的31 x 31矩阵的条目。

Here is what I have tried 这是我尝试过的

# Vector of 31 Tissues
tissues <- names(sampleTypes)
names(tissues) <- c("SMTS2")

# Replicate 31 times, one for each unique tissue in SMTS1
rep.tissues <- rep(tissues, 31)

# Make data frame column for merge
rep.df <- as.data.frame(t(rep.tissues)
names(rep.df) <- "SMTS2"

# Merge
match <- merge(counts, rep.df, by = "SMTS2", all.x = TRUE)

However, the output for this is large because of duplicates and, removing those, I'm left with a data frame that is identical to the original counts . 但是,由于有重复项,因此此输出很大,除去重复项后,剩下的数据帧与原始counts相同。 Additionally, I realize that this does nothing to fill in the counts$Count value with a 0 for each new row created. 另外,我意识到对于创建的每个新行,此操作都无法用0填充counts$Count值。

TL;DR I need to create all missing pairwise values and update a third column with a 0 for each row created. TL; DR我需要创建所有丢失的成对值,并为创建的每一行将第三列更新为0 These will be used to fill in a 31 x 31 matrix showing which tissues had been collected together from the same subject. 这些将用于填写31 x 31矩阵,显示从同一对象中收集到的组织。

You can use tidyr::gather 您可以使用tidyr::gather

#Some simulated data

library(tidyverse)# will conflict with plyr
df <- expand.grid(c1 = letters[1:4], c2 = letters[1:4]) %>% 
  mutate(Count = round(runif(16, 1,100))) %>% 
  slice(-c(3, 7, 12))# missing pairs

df %>% spread(key  = c2, value = Count, fill = 0)

# A tibble: 4 x 5
      c1     a     b     c     d
* <fctr> <dbl> <dbl> <dbl> <dbl>
1      a     5    16    18    16
2      b    23    38    58    93
3      c     0     0    81    47
4      d    78    32     0    34

fill argument puts zeros in where there is no data fill参数将零放置在没有数据的地方

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM