简体   繁体   中英

Extract min and max information by sequential similar parts of data frame in R

I have a data frame that corresponds to the path taken by a river, describing elevation and distance. I need to evaluate each different ground path traveled by the river and extract this information.

Example:

df = data.frame(Soil = c("Forest", "Forest",
               "Grass", "Grass","Grass",
               "Scrub", "Scrub","Scrub","Scrub",
               "Grass", "Grass","Grass","Grass",
               "Forest","Forest","Forest","Forest","Forest","Forest"),
      Distance = c(1, 5, 
                   10, 15, 56,
                   59, 67, 89, 99,
                   102, 105, 130, 139,
                   143, 145, 167, 189, 190, 230),
      Elevation = c(1500, 1499,
                    1470, 1467, 1456,
                    1450, 1445, 1440, 1435,
                    1430, 1420, 1412, 1400,
                    1390, 1387, 1384, 1380, 1376, 1370))

Soil      Distance Elevation
1  Forest        1      1500
2  Forest        5      1499
3   Grass       10      1470
4   Grass       15      1467
5   Grass       56      1456
6   Scrub       59      1450
7   Scrub       67      1445
8   Scrub       89      1440
9   Scrub       99      1435
10  Grass      102      1430
11  Grass      105      1420
12  Grass      130      1412
13  Grass      139      1400
14 Forest      143      1390
15 Forest      145      1387
16 Forest      167      1384
17 Forest      189      1380
18 Forest      190      1376
19 Forest      230      1370

But i need to something like this:

    Soil Distance.Min Distance.Max Elevation.Min Elevation.Max
1 Forest            1            5          1499          1500
2  Grass           10           56          1456          1470
3  Scrub           59           99          1435          1450
4  Grass          102          139          1400          1430
5 Forest          143          230          1370          1390

I tried to use group_by() and which.min(Soil) , but that takes into account the whole df, not each path.

We need a run-length encoding to track consecutive Soil .

Using this function (fashioned to mimic data.table::rleid ):

myrleid <- function (x) {
    r <- rle(x)
    rep(seq_along(r$lengths), times = r$lengths)
}

We can do

df %>%
  group_by(grp = myrleid(Soil)) %>%
  summarize(Soil = Soil[1], across(c(Distance, Elevation), list(min = min, max = max))) %>%
  select(-grp)
# # A tibble: 5 x 5
#   Soil   Distance_min Distance_max Elevation_min Elevation_max
#   <chr>         <dbl>        <dbl>         <dbl>         <dbl>
# 1 Forest            1            5          1499          1500
# 2 Grass            10           56          1456          1470
# 3 Scrub            59           99          1435          1450
# 4 Grass           102          139          1400          1430
# 5 Forest          143          230          1370          1390

You can try this:

df = df %>% mutate(id=data.table::rleid(Soil))

inner_join(
  distinct(df %>% select(Soil,id)),  
  df %>% 
    group_by(id) %>% 
    summarize(across(Distance:Elevation, .fns = list("min" = min,"max"=max))),
  by="id"
) %>% select(!id)

Output:

    Soil Distance_min Distance_max Elevation_min Elevation_max
1 Forest            1            5          1499          1500
2  Grass           10           56          1456          1470
3  Scrub           59           99          1435          1450
4  Grass          102          139          1400          1430
5 Forest          143          230          1370          1390

Or, even more concise, thanks to r2evans.

df %>% 
  group_by(id = data.table::rleid(Soil)) %>% 
  summarize(Soil=first(Soil),across(Distance:Elevation, .fns = list("min" = min,"max"=max))) %>% 
  select(!id)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM