简体   繁体   中英

How can I compare rows of ONE dataframe in R?

I have a dataframe with a lot of rows and at least 13 columns. I need to compare each row with the previous one to see if it is exactly the same in two columns and different at the rest.

If two rows are equal in two columns, I would like to put it those rows in a new dataframe.

Here it's my dataframe.

在此处输入图像描述

The first three rows they have the sample "Sample" but only two of them, same "Gene". Rows 7 and 8 have the same sample and gene too.

I would like to have a NEW DATAFRAME with only the rows that have the same sample and same gene. Like this:

在此处输入图像描述

I wrote this code:

Vec_sample <- c()    
Vec_genes <- c()
Vec_variants <- c()
Vec_chr <- c()
Vec_coordinate <- c()
Vec_aa <- c()
Vec_Rs <- c()
`%notin%` <- Negate(`%in%`)


for (row in 1:nrow(dataframe))
{
  for (row_compare in 1:nrow(dataframe))
  {
    if ((dataframe$Gene[row] == dataframe$Gene[row_compare]) 
        & (row != row_compare))
    {
      if ((dataframe$Sample[row] %notin% Vec_sample) &
          (dataframe$Sample[row] == dataframe$Sample[row_compare]))
      {
        
        Vec_sample <- c(Vec_sample , dataframe$Sample[row])
        Vec_sample <- c(Vec_sample , dataframe$Sample[row_compare])
        Vec_genes <- c(Vec_genes, dataframe$Gene[row])
        Vec_genes <- c(Vec_genes, dataframe$Gene[row_compare])
        Vec_variants <- c(Vec_variants , dataframe$Variants[row])
        Vec_variants <- c(Vec_variants , dataframe$Variants[row_compare])
        Vec_chr <- c(Vec_chr , dataframe$Chr[row])
        Vec_chr <- c(Vec_chr , dataframe$Chr[row_compare])
        Vec_coordinate <- c(Vec_coordinate, dataframe$Coordinate[row])
        Vec_coordinate <- c(Vec_coordinate, dataframe$Coordinate[row_compare])
        Vec_aa <- c(Vec_aa , dataframe$aa[row])
        Vec_aa <- c(Vec_aa , dataframe$aa[row_compare])
        Vec_Rs <- c(Vec_Rs , dataframe$Rs[row])
        Vec_Rs <- c(Vec_Rs , dataframe$Rs[row_compare])
      }
    }
  }
}

Finally, when loops are finished, I create a dataframe with the results.

final_dataframe <- data.frame(Vec_sample, Vec_genes, Vec_variants, Vec_chr, Vec_coordinate, Vec_aa, Vec_Rs).

Everything is duplicated in loops because I need the couple of sample and gene that are equal (and of course, the rest of the information). I wrote two for loops because I wanted to compare the actual gene with the other.

Problem? If the sample it's already saved in the vector "Vec_sample", if there is another couple with same sample, my script won't saved this couple. (For example, with sample 14-043, firstly it will saved the couple of gene ALG9, but it won't saved the couple of gene MNS1).

Here it's my wrong new dataframe .

在此处输入图像描述

I put that exception because when I run the two loops, the table would be check more than once and it would save the gene couple many times and it would be repeated.

Sorry if my syntax or the way of programming is inefficient, I'm starting in this world and I'm not really expert.

I hope I have explained myself well. Thank you very much in advance

I provide the input data.

 structure(list(Sample = c("14-043", "14-043", "14-043", "14-043", 
"14-043", "14-043", "14-077", "14-077", "13-340", "15-642", "15-642", 
"15-642", "12-975"), Gene = c("ALG9", "ALG10B", "ALG9", "SLC5A9", 
"MNS1", "MNS1", "ALG9", "ALG9", "GPI", "MNS1", "HK3", "MNS1", 
"HK3"), Variant = c("T>T/G", "C>A/G", "C>C/G", "A>A/T", "A>T/T", 
"C>C/T", "T>T/G", "C>C/G", "C>G/G", "A>T/T", "T>T/A", "C>C/T", 
"T>T/A"), Chr = c(4, 4, 4, 13, 2, 2, 4, 4, 20, 2, 8, 2, 8), Coordinate = c(23410158, 
3422351, 23410451, 2341043423, 324652341, 3246520, 23410158, 
23410451, 234541, 324652341, 23412341, 3246520, 23412341), aa = c("Gly44Thr", 
"His8Pro", "Ser44Thr", "Thr4Pro", "Ala45Ala", "Ala45Leu", "Gly44Thr", 
"Ser44Thr", "Phe3Ala", "Ala45Ala", "Val34His", "Ala45Leu", "Val34His"
), Rs = c("rs1715919", "rs1734532413", "rs1732413", "rs173240", 
"rs12305", "rs10356", "rs1715919", "rs1732413", "rs12342", "rs12305", 
"rs9997", "rs10356", "rs9997")), row.names = c(NA, -13L), class = "data.frame")

If I understand you correctly, you can do this with dplyr:filter , using lead and lag to check the previous and next rows

df <- structure(list(Sample = c("14-043", "14-043", "14-043", "14-043", 
                          "14-043", "14-077", "14-077", "13-340", "15-642", "15-642", "12-975"
), Gene = c("ALG9", "ALG9", "SLC5A9", "MNS1", "MNS1", "ALG9", 
            "ALG9", "GPI", "MNS1", "MNS1", "HK3"), Variant = c("T>T/G", "C>C/G", 
                                                               "A>A/T", "A>T/T", "C>C/T", "T>T/G", "C>C/G", "C>G/G", "A>T/T", 
                                                               "C>C/T", "T>T/A"), Chr = c(4, 4, 13, 2, 2, 4, 4, 20, 2, 2, 8), 
Coordinate = c(23410158, 23410451, 2341043423, 324652341, 
               3246520, 23410158, 23410451, 234541, 324652341, 3246520, 
               23412341), aa = c("Gly44Thr", "Ser44Thr", "Thr4Pro", "Ala45Ala", 
                                 "Ala45Leu", "Gly44Thr", "Ser44Thr", "Phe3Ala", "Ala45Ala", 
                                 "Ala45Leu", "Val34His"), Rs = c("rs1715919", "rs1732413", 
                                                                 "rs173240", "rs12305", "rs10356", "rs1715919", "rs1732413", 
                                                                 "rs12342", "rs12305", "rs10356", "rs9997")), row.names = c(NA, 
                                                                                                                            -11L), class = "data.frame")
library(tidyverse)
df %>% filter(Sample == lag(Sample) | Sample == lead(Sample), 
              Gene == lag(Gene) | Gene == lead(Gene))
#>   Sample Gene Variant Chr Coordinate       aa        Rs
#> 1 14-043 ALG9   T>T/G   4   23410158 Gly44Thr rs1715919
#> 2 14-043 ALG9   C>C/G   4   23410451 Ser44Thr rs1732413
#> 3 14-043 MNS1   A>T/T   2  324652341 Ala45Ala   rs12305
#> 4 14-043 MNS1   C>C/T   2    3246520 Ala45Leu   rs10356
#> 5 14-077 ALG9   T>T/G   4   23410158 Gly44Thr rs1715919
#> 6 14-077 ALG9   C>C/G   4   23410451 Ser44Thr rs1732413
#> 7 15-642 MNS1   A>T/T   2  324652341 Ala45Ala   rs12305
#> 8 15-642 MNS1   C>C/T   2    3246520 Ala45Leu   rs10356

Created on 2020-08-11 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM