简体   繁体   中英

Removing rows based on a stretch of repeated values in R, but not the whole data frame

I have a data frame of ~20,000 rows that is two columns: Time and Steps. There are stretches duplicate/repeated readings of steps that are a few seconds to a few minutes apart, so unique() does not catch them. !duplicates removes duplicates for the whole data frame, so that there is only one row with X steps, even if a legit record days later has X steps. I want to remove these stretches of repeats, leaving one unique record, it doesn't matter which one is kept. Also, the number of repeats is not consistent. It may be 8, it may be 2.

>View(df) 
          Time                Steps
    13737 2017-08-07 09:53:14 16
    13738 2017-08-07 09:53:14 16
    13739 2017-08-07 09:57:51 61
    13740 2017-08-07 09:59:05 61
    13741 2017-08-07 09:59:07 61
    13742 2017-08-07 09:59:11 61
    13743 2017-08-07 09:59:14 61
    13744 2017-08-07 09:59:14 61
    13745 2017-08-07 10:03:51 29
    13746 2017-08-07 10:05:05 29
    13747 2017-08-07 10:05:07 29
    13748 2017-08-07 10:05:11 29
    13749 2017-08-07 10:05:14 29
    13750 2017-08-07 10:05:14 29
    13751 2017-08-07 10:09:51 54
    13752 2017-08-07 10:11:05 54
    13753 2017-08-07 10:11:07 54

Desired output:

13737 2017-08-07 09:53:14 16
13739 2017-08-07 09:57:51 61
13745 2017-08-07 10:03:51 29
13751 2017-08-07 10:09:51 54

Another solution I've pondered is just paring down the dataset to just take rows every 10 minutes or so, and if that is ideal, I could use some guidance there too

We can use dplyr . Arrange the data frame by Steps and Time first and then group by Steps . Finally, slice the data frame for the first row per group.

library(dplyr)
library(lubridate)

df2 <- df %>%
  mutate(Time = ymd_hms(Time)) %>%
  arrange(Steps, Time) %>%
  group_by(Steps) %>%
  slice(1)

Or this

df2 <- df %>%
  mutate(Time = ymd_hms(Time)) %>%
  arrange(Steps, Time) %>%
  group_by(Steps) %>%
  filter(row_number() == 1)

Or we can use the data.table .

library(data.table)
dt <- as.data.table(df)
dt2 <- dt[, .SD[1], by = Steps][, c("Time", "Steps")]

Data Preparation

df <- read.table(text = "          Time                Steps
    13737 '2017-08-07 09:53:14' 16
    13738 '2017-08-07 09:53:14' 16
    13739 '2017-08-07 09:57:51' 61
    13740 '2017-08-07 09:59:05' 61
    13741 '2017-08-07 09:59:07' 61
    13742 '2017-08-07 09:59:11' 61
    13743 '2017-08-07 09:59:14' 61
    13744 '2017-08-07 09:59:14' 61
    13745 '2017-08-07 10:03:51' 29
    13746 '2017-08-07 10:05:05' 29
    13747 '2017-08-07 10:05:07' 29
    13748 '2017-08-07 10:05:11' 29
    13749 '2017-08-07 10:05:14' 29
    13750 '2017-08-07 10:05:14' 29
    13751 '2017-08-07 10:09:51' 54
    13752 '2017-08-07 10:11:05' 54
    13753 '2017-08-07 10:11:07' 54",
                 header = TRUE, stringsAsFactors = FALSE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM