简体   繁体   中英

How to remove rows from a data frame when the values in one column are not increasing in a consecutive way

I have a data frame on R and I want to remove those rows which are not increasing in a consecutive way in the column B. I mean, the value in each row has to be higher than the previous one but lower than the next one. I do not want to sort the data frame according to the column B because I want to keep the order in column A. I think I can do this with if statements but I do not have enough experience in R, thanks in advanced.

What I have is this, and I have to remove the starred values.

A       B   
26.00   11158115 
27.00   16722714* 
27.08   11881252 
90.25   69428973 
90.27   69749777 
93.30   64207240* 
95.90   71428751 
96.00   71670964 
107.65  100385980 
107.75  226164158* 
107.8   103280320 

I need this:

A       B   
26.00   11158115 
27.08   11881252 
90.25   69428973 
90.27   69749777 
95.90   71428751 
96.00   71670964 
107.65  100385980 
107.80  103280320 

Here is a solution, sort of:

A <- c(26.00, 27.00, 27.08, 90.25, 90.27, 93.30, 95.90, 96.00, 107.65, 107.75, 107.8)
B <- c(11158115, 16722714, 11881252, 69428973, 69749777, 64207240, 71428751, 71670964, 100385980,
       226164158, 103280320)
d <- data.frame(A, B)
repeat {
   delta <- diff(d$B)
               # delta gives you the difference between successive values of B
               # delta[1] corresponds to the difference between B[2] and B[1]
   if(all(delta > 0)) {
      break
   }
   iWrong <- 1 + which(delta < 0)
               # '1 +' means that if the next value is not larger than the previous value
               # (delta is not positive), we delete the next value
               # you can remove '1+' and delete this value instead
   d <- d[-iWrong,]
}

I say "sort of" because it is unclear for me which rows exactly should be removed. Why to remove row 2 instead of row 3? Both will give you increasing values in B. With my solution you will get:

1   26.00  11158115
2   27.00  16722714
4   90.25  69428973
5   90.27  69749777
7   95.90  71428751
8   96.00  71670964
9  107.65 100385980
10 107.75 226164158

I can't find a better solution, but at least it works.

df = read.table(text = "A,B 
26.00,11158115
27.00,16722714
27.08,11881252
90.25,69428973
90.27,69749777
93.30,64207240
95.90,71428751
96.00,71670964
107.65,100385980
107.75,226164158
107.8,103280320", header = TRUE, sep = ",", stringsAsFactors = FALSE)

r = 2
repeat {

    if ((df$B[r] < df$B[r-1] | df$B[r] > df$B[r+1]) & df$B[r-1] < df$B[r+1]) {
        df <- df[-r,]    
    } else {
        r = r + 1
    }

    if (r == nrow(df)) break
}

df

Output:

        A         B
1   26.00  11158115
3   27.08  11881252
4   90.25  69428973
5   90.27  69749777
7   95.90  71428751
8   96.00  71670964
9  107.65 100385980
11 107.80 103280320

Explanation:

We run through each row of the dataframe from the second element (the first one will always be valid for being the first one). Then, we delete each row with the expected criterion: the value must be higher than the previous one and lower than the next one ( (B[r] < B[r-1] or B[r] > B[r+1])) . But with this criterion we don't get the expected result so we also verify that the subsequent value is higher than the previous one ( B[r-1] < df$B[r+1] )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM