简体   繁体   English

如何比较 R 中的 ONE dataframe 的行?

[英]How can I compare rows of ONE dataframe in R?

I have a dataframe with a lot of rows and at least 13 columns.我有一个 dataframe 有很多行和至少 13 列。 I need to compare each row with the previous one to see if it is exactly the same in two columns and different at the rest.我需要将每一行与前一行进行比较,看看它在两列中是否完全相同,在 rest 处是否不同。

If two rows are equal in two columns, I would like to put it those rows in a new dataframe.如果两行在两列中相等,我想将这些行放在新的 dataframe 中。

Here it's my dataframe.这是我的 dataframe。

在此处输入图像描述

The first three rows they have the sample "Sample" but only two of them, same "Gene".前三行有样本“Sample”,但只有两个,相同的“Gene”。 Rows 7 and 8 have the same sample and gene too.第 7 行和第 8 行也具有相同的样本和基因。

I would like to have a NEW DATAFRAME with only the rows that have the same sample and same gene.我想要一个新的 DATAFRAME,只有具有相同样本和相同基因的行。 Like this:像这样:

在此处输入图像描述

I wrote this code:我写了这段代码:

Vec_sample <- c()    
Vec_genes <- c()
Vec_variants <- c()
Vec_chr <- c()
Vec_coordinate <- c()
Vec_aa <- c()
Vec_Rs <- c()
`%notin%` <- Negate(`%in%`)


for (row in 1:nrow(dataframe))
{
  for (row_compare in 1:nrow(dataframe))
  {
    if ((dataframe$Gene[row] == dataframe$Gene[row_compare]) 
        & (row != row_compare))
    {
      if ((dataframe$Sample[row] %notin% Vec_sample) &
          (dataframe$Sample[row] == dataframe$Sample[row_compare]))
      {
        
        Vec_sample <- c(Vec_sample , dataframe$Sample[row])
        Vec_sample <- c(Vec_sample , dataframe$Sample[row_compare])
        Vec_genes <- c(Vec_genes, dataframe$Gene[row])
        Vec_genes <- c(Vec_genes, dataframe$Gene[row_compare])
        Vec_variants <- c(Vec_variants , dataframe$Variants[row])
        Vec_variants <- c(Vec_variants , dataframe$Variants[row_compare])
        Vec_chr <- c(Vec_chr , dataframe$Chr[row])
        Vec_chr <- c(Vec_chr , dataframe$Chr[row_compare])
        Vec_coordinate <- c(Vec_coordinate, dataframe$Coordinate[row])
        Vec_coordinate <- c(Vec_coordinate, dataframe$Coordinate[row_compare])
        Vec_aa <- c(Vec_aa , dataframe$aa[row])
        Vec_aa <- c(Vec_aa , dataframe$aa[row_compare])
        Vec_Rs <- c(Vec_Rs , dataframe$Rs[row])
        Vec_Rs <- c(Vec_Rs , dataframe$Rs[row_compare])
      }
    }
  }
}

Finally, when loops are finished, I create a dataframe with the results.最后,当循环完成时,我用结果创建了一个 dataframe。

final_dataframe <- data.frame(Vec_sample, Vec_genes, Vec_variants, Vec_chr, Vec_coordinate, Vec_aa, Vec_Rs).

Everything is duplicated in loops because I need the couple of sample and gene that are equal (and of course, the rest of the information).一切都在循环中重复,因为我需要一对相等的样本和基因(当然还有信息的 rest)。 I wrote two for loops because I wanted to compare the actual gene with the other.我写了两个 for 循环,因为我想将实际基因与另一个进行比较。

Problem?问题? If the sample it's already saved in the vector "Vec_sample", if there is another couple with same sample, my script won't saved this couple.如果样本已经保存在向量“Vec_sample”中,如果有另一对具有相同样本,我的脚本将不会保存这对。 (For example, with sample 14-043, firstly it will saved the couple of gene ALG9, but it won't saved the couple of gene MNS1). (例如样本14-043,首先会保存ALG9这对基因,但不会保存MNS1这对基因)。

Here it's my wrong new dataframe .这是我错误的新 dataframe

在此处输入图像描述

I put that exception because when I run the two loops, the table would be check more than once and it would save the gene couple many times and it would be repeated.我提出了这个例外,因为当我运行两个循环时,表格会被检查不止一次,它会多次保存基因对,并且会重复。

Sorry if my syntax or the way of programming is inefficient, I'm starting in this world and I'm not really expert.抱歉,如果我的语法或编程方式效率低下,我是从这个世界开始的,我并不是真正的专家。

I hope I have explained myself well.我希望我已经很好地解释了自己。 Thank you very much in advance非常感谢您提前

I provide the input data.我提供输入数据。

 structure(list(Sample = c("14-043", "14-043", "14-043", "14-043", 
"14-043", "14-043", "14-077", "14-077", "13-340", "15-642", "15-642", 
"15-642", "12-975"), Gene = c("ALG9", "ALG10B", "ALG9", "SLC5A9", 
"MNS1", "MNS1", "ALG9", "ALG9", "GPI", "MNS1", "HK3", "MNS1", 
"HK3"), Variant = c("T>T/G", "C>A/G", "C>C/G", "A>A/T", "A>T/T", 
"C>C/T", "T>T/G", "C>C/G", "C>G/G", "A>T/T", "T>T/A", "C>C/T", 
"T>T/A"), Chr = c(4, 4, 4, 13, 2, 2, 4, 4, 20, 2, 8, 2, 8), Coordinate = c(23410158, 
3422351, 23410451, 2341043423, 324652341, 3246520, 23410158, 
23410451, 234541, 324652341, 23412341, 3246520, 23412341), aa = c("Gly44Thr", 
"His8Pro", "Ser44Thr", "Thr4Pro", "Ala45Ala", "Ala45Leu", "Gly44Thr", 
"Ser44Thr", "Phe3Ala", "Ala45Ala", "Val34His", "Ala45Leu", "Val34His"
), Rs = c("rs1715919", "rs1734532413", "rs1732413", "rs173240", 
"rs12305", "rs10356", "rs1715919", "rs1732413", "rs12342", "rs12305", 
"rs9997", "rs10356", "rs9997")), row.names = c(NA, -13L), class = "data.frame")

If I understand you correctly, you can do this with dplyr:filter , using lead and lag to check the previous and next rows如果我理解正确,您可以使用dplyr:filter执行此操作,使用leadlag检查前一行和下一行

df <- structure(list(Sample = c("14-043", "14-043", "14-043", "14-043", 
                          "14-043", "14-077", "14-077", "13-340", "15-642", "15-642", "12-975"
), Gene = c("ALG9", "ALG9", "SLC5A9", "MNS1", "MNS1", "ALG9", 
            "ALG9", "GPI", "MNS1", "MNS1", "HK3"), Variant = c("T>T/G", "C>C/G", 
                                                               "A>A/T", "A>T/T", "C>C/T", "T>T/G", "C>C/G", "C>G/G", "A>T/T", 
                                                               "C>C/T", "T>T/A"), Chr = c(4, 4, 13, 2, 2, 4, 4, 20, 2, 2, 8), 
Coordinate = c(23410158, 23410451, 2341043423, 324652341, 
               3246520, 23410158, 23410451, 234541, 324652341, 3246520, 
               23412341), aa = c("Gly44Thr", "Ser44Thr", "Thr4Pro", "Ala45Ala", 
                                 "Ala45Leu", "Gly44Thr", "Ser44Thr", "Phe3Ala", "Ala45Ala", 
                                 "Ala45Leu", "Val34His"), Rs = c("rs1715919", "rs1732413", 
                                                                 "rs173240", "rs12305", "rs10356", "rs1715919", "rs1732413", 
                                                                 "rs12342", "rs12305", "rs10356", "rs9997")), row.names = c(NA, 
                                                                                                                            -11L), class = "data.frame")
library(tidyverse)
df %>% filter(Sample == lag(Sample) | Sample == lead(Sample), 
              Gene == lag(Gene) | Gene == lead(Gene))
#>   Sample Gene Variant Chr Coordinate       aa        Rs
#> 1 14-043 ALG9   T>T/G   4   23410158 Gly44Thr rs1715919
#> 2 14-043 ALG9   C>C/G   4   23410451 Ser44Thr rs1732413
#> 3 14-043 MNS1   A>T/T   2  324652341 Ala45Ala   rs12305
#> 4 14-043 MNS1   C>C/T   2    3246520 Ala45Leu   rs10356
#> 5 14-077 ALG9   T>T/G   4   23410158 Gly44Thr rs1715919
#> 6 14-077 ALG9   C>C/G   4   23410451 Ser44Thr rs1732413
#> 7 15-642 MNS1   A>T/T   2  324652341 Ala45Ala   rs12305
#> 8 15-642 MNS1   C>C/T   2    3246520 Ala45Leu   rs10356

Created on 2020-08-11 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2020 年 8 月 11 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM