[英]R - Add Missing Pairs in Data Frame by Key and Update Value
I have a data frame with various subjects, each of whom contributed at least one tissue sample, ie Blood, Heart, Liver, etc, while many of them contributed samples of multiple tissues. 我有一个包含各种主题的数据框,每个主题都贡献了至少一个组织样本,即血液,心脏,肝脏等,而许多主题贡献了多个组织的样本。 There are 31 unique tissues, and I want to create a 31 x 31 matrix indicating tissues pairs collected from a single subject.
有31个独特的组织,我想创建一个31 x 31的矩阵,指示从单个对象收集的组织对。 With row and column names being the names of the tissues, then, the diagonals would give the total number of subjects from whom a tissue sample was collected, and the off diagonals would include the number of subjects who had given both (ie, if a subject had given a heart and lung sample, the intersection of the heart row/column and lung column/row would increase by 1).
将行名和列名作为组织的名称,然后,对角线将给出从中收集组织样本的受试者总数,而对角线将包括同时给予两者的受试者的数量(即,如果如果受试者提供了心脏和肺部样本,则心脏行/列与肺部列/行的交点将增加1)。
So far, I have been able to get the data (using plyr
) into a data frame counts
that includes each unique pair found, along with the number of subjects who have contributed both tissue types. 到目前为止,我已经能够将数据(使用
plyr
)获取到一个数据帧counts
,该counts
包括找到的每个唯一对,以及贡献了这两种组织类型的受试者的数量。 When SMTS1
and SMTS2
match, the value in Count
indicates the total number of samples of that tissue 当
SMTS1
和SMTS2
匹配时, Count
的值表示该组织的样本总数
> head(counts, n = 32L)
SMTS1 SMTS2 Count
1 Adipose Tissue Adipose Tissue 439
2 Adipose Tissue Adrenal Gland 137
3 Adipose Tissue Bladder 11
4 Adipose Tissue Blood 423
5 Adipose Tissue Blood Vessel 368
6 Adipose Tissue Brain 146
7 Adipose Tissue Breast 190
8 Adipose Tissue Cervix Uteri 8
9 Adipose Tissue Colon 248
10 Adipose Tissue Esophagus 341
11 Adipose Tissue Fallopian Tube 6
12 Adipose Tissue Heart 266
13 Adipose Tissue Kidney 33
14 Adipose Tissue Liver 119
15 Adipose Tissue Lung 285
16 Adipose Tissue Muscle 380
17 Adipose Tissue Nerve 290
18 Adipose Tissue Ovary 99
19 Adipose Tissue Pancreas 174
20 Adipose Tissue Pituitary 102
21 Adipose Tissue Prostate 105
22 Adipose Tissue Salivary Gland 64
23 Adipose Tissue Skin 423
24 Adipose Tissue Small Intestine 97
25 Adipose Tissue Spleen 110
26 Adipose Tissue Stomach 182
27 Adipose Tissue Testis 168
28 Adipose Tissue Thyroid 290
29 Adipose Tissue Uterus 81
30 Adipose Tissue Vagina 86
31 Adrenal Gland Adipose Tissue 137
32 Adrenal Gland Adrenal Gland 159
... [823 Additional Rows]
The way this is set up, each of the 31 tissues is present in counts$SMTS1
, and counts$SMTS2
contains all of the tissues for which a pair exists. 设置方式,31个组织中的每一个都以
counts$SMTS1
, counts$SMTS2
包含一对存在的所有组织。 You'll see for Adipose Tissue, there are only 30 entries, indicating that there is one tissue type that is not found with Adipose Tissue. 您会看到“脂肪组织”只有30个条目,这表明“脂肪组织”中找不到一种组织类型。
What I would like to do is make it so that each unique value in SMTS1
is paired with each of the 31 possible tissues. 我想做的是使
SMTS1
中的每个唯一值与31种可能的组织中的每一个配对。 In the case shown, for example, Adipose Tissue only has 30 pairs, indicating that one pair does not exist. 在所示的情况下,例如,脂肪组织只有30对,表示不存在一对。 In this case, that pair is Bone Marrow.
在这种情况下,那对就是骨髓。 What I would like, then, is for my
counts
data frame, upon recognizing that, create two additional rows 然后,我想对我的
counts
数据框进行识别后,再创建两行
SMTS1 SMTS2 Count
1 Adipose Tissue Bone Marrow 0
2 Bone Marrow Adipose Tissue 0
giving 0
values indicating that a pair doesn't exist. 给出
0
值表示一对不存在。 From there, the, I should have 961 numeric values, which will ultimately end up being the entries for my 31 x 31 matrix. 从那里,我应该有961个数值,这些数值最终将最终成为我的31 x 31矩阵的条目。
Here is what I have tried 这是我尝试过的
# Vector of 31 Tissues
tissues <- names(sampleTypes)
names(tissues) <- c("SMTS2")
# Replicate 31 times, one for each unique tissue in SMTS1
rep.tissues <- rep(tissues, 31)
# Make data frame column for merge
rep.df <- as.data.frame(t(rep.tissues)
names(rep.df) <- "SMTS2"
# Merge
match <- merge(counts, rep.df, by = "SMTS2", all.x = TRUE)
However, the output for this is large because of duplicates and, removing those, I'm left with a data frame that is identical to the original counts
. 但是,由于有重复项,因此此输出很大,除去重复项后,剩下的数据帧与原始
counts
相同。 Additionally, I realize that this does nothing to fill in the counts$Count
value with a 0
for each new row created. 另外,我意识到对于创建的每个新行,此操作都无法用
0
填充counts$Count
值。
TL;DR I need to create all missing pairwise values and update a third column with a 0
for each row created. TL; DR我需要创建所有丢失的成对值,并为创建的每一行将第三列更新为
0
。 These will be used to fill in a 31 x 31 matrix showing which tissues had been collected together from the same subject. 这些将用于填写31 x 31矩阵,显示从同一对象中收集到的组织。
You can use tidyr::gather
您可以使用
tidyr::gather
#Some simulated data
library(tidyverse)# will conflict with plyr
df <- expand.grid(c1 = letters[1:4], c2 = letters[1:4]) %>%
mutate(Count = round(runif(16, 1,100))) %>%
slice(-c(3, 7, 12))# missing pairs
df %>% spread(key = c2, value = Count, fill = 0)
# A tibble: 4 x 5
c1 a b c d
* <fctr> <dbl> <dbl> <dbl> <dbl>
1 a 5 16 18 16
2 b 23 38 58 93
3 c 0 0 81 47
4 d 78 32 0 34
fill
argument puts zeros in where there is no data fill
参数将零放置在没有数据的地方
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.