[英]Comparing two data sets in R
I've searched around here and google for hours and cannot find a solution to my problem. 我在这里和Google周围搜索了几个小时,找不到解决我问题的方法。
I have two datasets containing genes. 我有两个包含基因的数据集。 One dataset is my dataset (snap) and I need to see if these genes are in the second bigger data set (catalog). 一个数据集是我的数据集(快照),我需要查看这些基因是否在第二个更大的数据集中(目录)。 I want the second column in snap (proxy) and the 21st column from catalog. 我想要快照中的第二列(代理)和目录中的第二十一列。 This is what my datasets look like; 这就是我的数据集的样子;
> head(snap)
SNP Proxy Distance RSquared DPrime
1 rs4246511 rs7540233 4541 0.874 1
2 rs4246511 rs4970634 15768 0.874 1
3 rs4246511 rs4532801 18960 0.874 1
4 rs4246511 rs9438982 22242 0.874 1
5 rs4246511 rs9438979 25034 0.874 1
6 rs4246511 rs4414011 25868 0.874 1
head(catalog)
SNPS MERGED SNP_ID_CURRENT CONTEXT INTERGENIC
1 rs7079041 0 7079041 intron 0
2 rs7244261 0 7244261 intergenic 1
3 rs10448044 0 10448044 intergenic 1
4 rs2610025 0 2610025 intergenic 1
5 rs1472147 0 1472147 intron 0
6 rs2648708 0 2648708 intron 0
*This is a small part of the datasets *这只是数据集的一小部分
To make it more complicated, I'd also like to be able to pull the whole row of data from both data sets. 为了使其更加复杂,我还希望能够从两个数据集中提取整行数据。
For the first part of my question I have tried using comparison (which I found from another similar question here). 对于我的问题的第一部分,我尝试使用比较(我从这里的另一个类似问题中发现)。 I decided to extract the columns I needed to simplify things (proxy is my column from snap and catalogsnps is the column from catalog); 我决定提取需要简化的列(代理是来自snap的列,catalogsnps是来自目录的列);
comparison <- compare(proxy, catalogsnps, allowAll=TRUE)
comparison$tM
difference <- data.frame(lapply(2:ncol(proxy),function(i)setdiff(cacheGenericsMetaData[,i],comparison$tM[,i])))
colnames(difference) <- colnames(proxy)
write.table(difference, file="difference.csv", sep=";", dec=".")
However with this syntax my output is simply a list of all my SNPs from snap. 但是,使用这种语法,我的输出只是来自snap的所有SNP的列表。
Output 输出量
1054 6267
1055 6273
1056 6297
1057 6297
1058 6314
1059 6331
1060 6340
1061 6345
1062 6346
1063 6350
1064 6364
1065 6412
1066 6417
1067 6417
1068 6430
Since this was hard to read, I added the line to get the excel file, this looks like this; 由于难以理解,因此我添加了一行以获取excel文件,如下所示:
x
1 rs7079041
2 rs7244261
3 rs10448044
4 rs2610025
5 rs1472147
6 rs2648708
7 rs11891
8 rs1801725
9 rs6852678
10 rs3135758
11 rs6838240
12 rs6838240
13 rs603894
14 rs3764796
15 rs3764796
16 rs2073214
17 rs4971100
18 rs4971100
19 rs11718502
20 rs10888073
21 rs7032317
I also found another possible solution on here, but I again only got a list of my SNPs. 我还在这里找到了另一种可能的解决方案,但我又只得到了我的SNP列表。
rows.diff <- function(catalog, proxy)
{
catalogsnps.vec <- apply(catalogsnps, 1, paste, collapse="")
proxy.vec <- apply(proxy, 1, paste, collaspse= "")
rows.diff <- catalogsnps[!catalogsnps.vec %in% proxy.vec,]
return(rows.diff)
}
write.table(rows.diff(catalogsnps, proxy), file="rowdiff.csv", sep=";", dec=",")
For the second part of my question, I am completely lost about where to start 对于我的问题的第二部分,我完全不知道从哪里开始
Many thanks for any help 非常感谢您的帮助
Claire 克莱尔
Why not just: 为什么不只是:
new.data <- merge(snap, catalog, by.x='proxy', by.y='catalogsnps')
This should give you a new dataframe whose rows are only the rows where proxy and catalogsnps match, and whose columns include all of the columns from the original dataframes. 这将为您提供一个新的数据框,其行仅是proxy和catalogsnps匹配的行,并且其列包括原始数据框的所有列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.