How to add a column to my data frame that calculates the distance between lat/long points between the previous point with matching IDs

Question

I have a data frame of individual animals with a unique ID, the lat/long where they were found, and the date they were found. The database has frequent returns of the same individual. I have over 2000 individuals. I want to add a column to my data frame to calculate euclidian distance between current location & previous location. I want to add a second column to tell me which calculation number I'm on for each individual. The data frame is already organized by sequential date. I'm trying to solve this in R.

Event	ID	Lat	Long
1	1	31.89	-80.98
2	2	31.54	-80.12
3	1	31.45	-81.92
4	1	31.64	-81.82
5	2	31.23	-80.98

Add a column so that now it looks like

Event	ID	Lat	Long	Dist.	Calculation #
1	1	31.89	-80.98	-	0
2	2	31.54	-80.12	-	0
3	1	31.45	-81.92	Distance between event 1 & 3	1
4	1	31.64	-81.82	Distance between event 3 & 4	2
5	2	31.23	-80.98	Distance between event 2 & 5	1

Is there a faster way to do this without a for loop? I'm stuck on where to start. I know I can use a distance function from the geospatial package once, I have the uniqueID sorted, but I'm having trouble iterating through my data.

Answer 1

Here is one option which leans on the sf package and dplyr . The function sf::st_distance calculates distances between pairs of points, and dplyr::lag can be used to look "one row behind". You will want to confirm your coordinate system, which I guessed here is WGS84 / 4326 .

library(dplyr)
library(sf)



dat <- read.table(text = " Event    ID  Lat Long
1   1   31.89   -80.98
2   2   31.54   -80.12
3   1   31.45   -81.92
4   1   31.64   -81.82
5   2   31.23   -80.98", h = T)


dat_sf <- st_as_sf(dat, coords = c('Long', 'Lat'), crs = 4326)


dat_sf %>%
  arrange(ID) %>%
  group_by(ID) %>%
  mutate(distance = as.numeric(st_distance(geometry, lag(geometry), by_element = TRUE)),
         calculation = row_number() - 1)
#> Simple feature collection with 5 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -81.92 ymin: 31.23 xmax: -80.12 ymax: 31.89
#> Geodetic CRS:  WGS 84
#> # A tibble: 5 x 5
#> # Groups:   ID [2]
#>   Event    ID       geometry distance calculation
#> * <int> <int>    <POINT [°]>    <dbl>       <dbl>
#> 1     1     1 (-80.98 31.89)      NA            0
#> 2     3     1 (-81.92 31.45)  101524.           1
#> 3     4     1 (-81.82 31.64)   23155.           2
#> 4     2     2 (-80.12 31.54)      NA            0
#> 5     5     2 (-80.98 31.23)   88615.           1

^{Created on 2022-11-14 by the reprex package (v2.0.0)}

Answer 2

Try this:

load library geosphere
create demo data
get all unique IDs and sort dataframe by ID and event
append last known coords of each animal to each row
apply distance function to each row

library(geosphere)
df <- data.frame(
    event = seq(5),
    id = c(1, 2, 1, 1, 2),
    lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
    long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)

keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
df <- keys %>% lapply(
    function(key){
        tmp <- df[df$id == key, ]
        tmp$last_lat <- tmp$lat
        tmp$last_long <- tmp$long
        tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
        tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
        tmp %>% return
    }
) %>% do.call(rbind, .)


df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))

Since you said you need speed, below is the same code as above but run in parallel:

library(tictoc)
library(parallel)

tic()
clust <- makeCluster(detectCores() - 1)

df <- data.frame(
    event = seq(5),
    id = c(1, 2, 1, 1, 2),
    lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
    long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)

clusterExport(clust, "df")
clusterEvalQ(clust, library(magrittr))
df <- keys %>% parLapply(
    clust, ., 
    function(key){
        tmp <- df[df$id == key, ]
        tmp$last_lat <- tmp$lat
        tmp$last_long <- tmp$long
        tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
        tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
        tmp %>% return
    }
) %>% do.call(rbind, .)

df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))

toc()

Above, tictoc just records the execution time. I just created a cluster with the number of your cpu cores minus 1, and changed the lapply part to parLapply The second version will be slower than the first if you have a small dataset (due to overhead setting up the parallel computation). But if you have a large dataset, the second version will be much faster.

How to add a column to my data frame that calculates the distance between lat/long points between the previous point with matching IDs

Question

2 answers

solution1
0 ACCPTED 2022-11-14 21:42:52

solution2
0 2022-11-14 21:43:23

How to add a column to my data frame that calculates the distance between lat/long points between the previous point with matching IDs

Question

2 answers

solution1 0 ACCPTED 2022-11-14 21:42:52

solution2 0 2022-11-14 21:43:23

solution1
0 ACCPTED 2022-11-14 21:42:52

solution2
0 2022-11-14 21:43:23