I have a dataframe with a lot of rows and at least 13 columns. I need to compare each row with the previous one to see if it is exactly the same in two columns and different at the rest.
If two rows are equal in two columns, I would like to put it those rows in a new dataframe.
Here it's my dataframe.
The first three rows they have the sample "Sample" but only two of them, same "Gene". Rows 7 and 8 have the same sample and gene too.
I would like to have a NEW DATAFRAME with only the rows that have the same sample and same gene. Like this:
I wrote this code:
Vec_sample <- c()
Vec_genes <- c()
Vec_variants <- c()
Vec_chr <- c()
Vec_coordinate <- c()
Vec_aa <- c()
Vec_Rs <- c()
`%notin%` <- Negate(`%in%`)
for (row in 1:nrow(dataframe))
{
for (row_compare in 1:nrow(dataframe))
{
if ((dataframe$Gene[row] == dataframe$Gene[row_compare])
& (row != row_compare))
{
if ((dataframe$Sample[row] %notin% Vec_sample) &
(dataframe$Sample[row] == dataframe$Sample[row_compare]))
{
Vec_sample <- c(Vec_sample , dataframe$Sample[row])
Vec_sample <- c(Vec_sample , dataframe$Sample[row_compare])
Vec_genes <- c(Vec_genes, dataframe$Gene[row])
Vec_genes <- c(Vec_genes, dataframe$Gene[row_compare])
Vec_variants <- c(Vec_variants , dataframe$Variants[row])
Vec_variants <- c(Vec_variants , dataframe$Variants[row_compare])
Vec_chr <- c(Vec_chr , dataframe$Chr[row])
Vec_chr <- c(Vec_chr , dataframe$Chr[row_compare])
Vec_coordinate <- c(Vec_coordinate, dataframe$Coordinate[row])
Vec_coordinate <- c(Vec_coordinate, dataframe$Coordinate[row_compare])
Vec_aa <- c(Vec_aa , dataframe$aa[row])
Vec_aa <- c(Vec_aa , dataframe$aa[row_compare])
Vec_Rs <- c(Vec_Rs , dataframe$Rs[row])
Vec_Rs <- c(Vec_Rs , dataframe$Rs[row_compare])
}
}
}
}
Finally, when loops are finished, I create a dataframe with the results.
final_dataframe <- data.frame(Vec_sample, Vec_genes, Vec_variants, Vec_chr, Vec_coordinate, Vec_aa, Vec_Rs).
Everything is duplicated in loops because I need the couple of sample and gene that are equal (and of course, the rest of the information). I wrote two for loops because I wanted to compare the actual gene with the other.
Problem? If the sample it's already saved in the vector "Vec_sample", if there is another couple with same sample, my script won't saved this couple. (For example, with sample 14-043, firstly it will saved the couple of gene ALG9, but it won't saved the couple of gene MNS1).
Here it's my wrong new dataframe .
I put that exception because when I run the two loops, the table would be check more than once and it would save the gene couple many times and it would be repeated.
Sorry if my syntax or the way of programming is inefficient, I'm starting in this world and I'm not really expert.
I hope I have explained myself well. Thank you very much in advance
I provide the input data.
structure(list(Sample = c("14-043", "14-043", "14-043", "14-043",
"14-043", "14-043", "14-077", "14-077", "13-340", "15-642", "15-642",
"15-642", "12-975"), Gene = c("ALG9", "ALG10B", "ALG9", "SLC5A9",
"MNS1", "MNS1", "ALG9", "ALG9", "GPI", "MNS1", "HK3", "MNS1",
"HK3"), Variant = c("T>T/G", "C>A/G", "C>C/G", "A>A/T", "A>T/T",
"C>C/T", "T>T/G", "C>C/G", "C>G/G", "A>T/T", "T>T/A", "C>C/T",
"T>T/A"), Chr = c(4, 4, 4, 13, 2, 2, 4, 4, 20, 2, 8, 2, 8), Coordinate = c(23410158,
3422351, 23410451, 2341043423, 324652341, 3246520, 23410158,
23410451, 234541, 324652341, 23412341, 3246520, 23412341), aa = c("Gly44Thr",
"His8Pro", "Ser44Thr", "Thr4Pro", "Ala45Ala", "Ala45Leu", "Gly44Thr",
"Ser44Thr", "Phe3Ala", "Ala45Ala", "Val34His", "Ala45Leu", "Val34His"
), Rs = c("rs1715919", "rs1734532413", "rs1732413", "rs173240",
"rs12305", "rs10356", "rs1715919", "rs1732413", "rs12342", "rs12305",
"rs9997", "rs10356", "rs9997")), row.names = c(NA, -13L), class = "data.frame")
If I understand you correctly, you can do this with dplyr:filter
, using lead
and lag
to check the previous and next rows
df <- structure(list(Sample = c("14-043", "14-043", "14-043", "14-043",
"14-043", "14-077", "14-077", "13-340", "15-642", "15-642", "12-975"
), Gene = c("ALG9", "ALG9", "SLC5A9", "MNS1", "MNS1", "ALG9",
"ALG9", "GPI", "MNS1", "MNS1", "HK3"), Variant = c("T>T/G", "C>C/G",
"A>A/T", "A>T/T", "C>C/T", "T>T/G", "C>C/G", "C>G/G", "A>T/T",
"C>C/T", "T>T/A"), Chr = c(4, 4, 13, 2, 2, 4, 4, 20, 2, 2, 8),
Coordinate = c(23410158, 23410451, 2341043423, 324652341,
3246520, 23410158, 23410451, 234541, 324652341, 3246520,
23412341), aa = c("Gly44Thr", "Ser44Thr", "Thr4Pro", "Ala45Ala",
"Ala45Leu", "Gly44Thr", "Ser44Thr", "Phe3Ala", "Ala45Ala",
"Ala45Leu", "Val34His"), Rs = c("rs1715919", "rs1732413",
"rs173240", "rs12305", "rs10356", "rs1715919", "rs1732413",
"rs12342", "rs12305", "rs10356", "rs9997")), row.names = c(NA,
-11L), class = "data.frame")
library(tidyverse)
df %>% filter(Sample == lag(Sample) | Sample == lead(Sample),
Gene == lag(Gene) | Gene == lead(Gene))
#> Sample Gene Variant Chr Coordinate aa Rs
#> 1 14-043 ALG9 T>T/G 4 23410158 Gly44Thr rs1715919
#> 2 14-043 ALG9 C>C/G 4 23410451 Ser44Thr rs1732413
#> 3 14-043 MNS1 A>T/T 2 324652341 Ala45Ala rs12305
#> 4 14-043 MNS1 C>C/T 2 3246520 Ala45Leu rs10356
#> 5 14-077 ALG9 T>T/G 4 23410158 Gly44Thr rs1715919
#> 6 14-077 ALG9 C>C/G 4 23410451 Ser44Thr rs1732413
#> 7 15-642 MNS1 A>T/T 2 324652341 Ala45Ala rs12305
#> 8 15-642 MNS1 C>C/T 2 3246520 Ala45Leu rs10356
Created on 2020-08-11 by the reprex package (v0.3.0)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.