简体   繁体   English

比较R中的两个数据集

[英]Comparing two data sets in R

I've searched around here and google for hours and cannot find a solution to my problem. 我在这里和Google周围搜索了几个小时,找不到解决我问题的方法。

I have two datasets containing genes. 我有两个包含基因的数据集。 One dataset is my dataset (snap) and I need to see if these genes are in the second bigger data set (catalog). 一个数据集是我的数据集(快照),我需要查看这些基因是否在第二个更大的数据集中(目录)。 I want the second column in snap (proxy) and the 21st column from catalog. 我想要快照中的第二列(代理)和目录中的第二十一列。 This is what my datasets look like; 这就是我的数据集的样子;

> head(snap)
        SNP     Proxy Distance RSquared DPrime
1 rs4246511 rs7540233     4541    0.874      1
2 rs4246511 rs4970634    15768    0.874      1
3 rs4246511 rs4532801    18960    0.874      1
4 rs4246511 rs9438982    22242    0.874      1
5 rs4246511 rs9438979    25034    0.874      1
6 rs4246511 rs4414011    25868    0.874      1 

head(catalog)
        SNPS MERGED SNP_ID_CURRENT    CONTEXT INTERGENIC
1  rs7079041      0        7079041     intron          0
2  rs7244261      0        7244261 intergenic          1
3 rs10448044      0       10448044 intergenic          1
4  rs2610025      0        2610025 intergenic          1
5  rs1472147      0        1472147     intron          0
6  rs2648708      0        2648708     intron          0

*This is a small part of the datasets *这只是数据集的一小部分

To make it more complicated, I'd also like to be able to pull the whole row of data from both data sets. 为了使其更加复杂,我还希望能够从两个数据集中提取整行数据。

For the first part of my question I have tried using comparison (which I found from another similar question here). 对于我的问题的第一部分,我尝试使用比较(我从这里的另一个类似问题中发现)。 I decided to extract the columns I needed to simplify things (proxy is my column from snap and catalogsnps is the column from catalog); 我决定提取需要简化的列(代理是来自snap的列,catalogsnps是来自目录的列);

    comparison <- compare(proxy, catalogsnps, allowAll=TRUE)
comparison$tM

difference <- data.frame(lapply(2:ncol(proxy),function(i)setdiff(cacheGenericsMetaData[,i],comparison$tM[,i])))
colnames(difference) <- colnames(proxy)
write.table(difference, file="difference.csv", sep=";", dec=".")

However with this syntax my output is simply a list of all my SNPs from snap. 但是,使用这种语法,我的输出只是来自snap的所有SNP的列表。

Output 输出量

1054  6267
1055  6273
1056  6297
1057  6297
1058  6314
1059  6331
1060  6340
1061  6345
1062  6346
1063  6350
1064  6364
1065  6412
1066  6417
1067  6417
1068  6430

Since this was hard to read, I added the line to get the excel file, this looks like this; 由于难以理解,因此我添加了一行以获取excel文件,如下所示:

x   
1   rs7079041
2   rs7244261
3   rs10448044
4   rs2610025
5   rs1472147
6   rs2648708
7   rs11891
8   rs1801725
9   rs6852678
10  rs3135758
11  rs6838240
12  rs6838240
13  rs603894
14  rs3764796
15  rs3764796
16  rs2073214
17  rs4971100
18  rs4971100
19  rs11718502
20  rs10888073
21  rs7032317

I also found another possible solution on here, but I again only got a list of my SNPs. 我还在这里找到了另一种可能的解决方案,但我又只得到了我的SNP列表。

rows.diff <- function(catalog, proxy)
{
  catalogsnps.vec <- apply(catalogsnps, 1, paste, collapse="")
  proxy.vec <- apply(proxy, 1, paste, collaspse= "")
  rows.diff <- catalogsnps[!catalogsnps.vec %in% proxy.vec,]
  return(rows.diff)
}
write.table(rows.diff(catalogsnps, proxy), file="rowdiff.csv", sep=";", dec=",")

For the second part of my question, I am completely lost about where to start 对于我的问题的第二部分,我完全不知道从哪里开始

Many thanks for any help 非常感谢您的帮助

Claire 克莱尔

Why not just: 为什么不只是:

new.data <- merge(snap, catalog, by.x='proxy', by.y='catalogsnps')

This should give you a new dataframe whose rows are only the rows where proxy and catalogsnps match, and whose columns include all of the columns from the original dataframes. 这将为您提供一个新的数据框,其行仅是proxy和catalogsnps匹配的行,并且其列包括原始数据框的所有列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM