简体   繁体   English

如何在我的数据框中添加一列来计算具有匹配 ID 的前一个点之间的纬度/经度点之间的距离

[英]How to add a column to my data frame that calculates the distance between lat/long points between the previous point with matching IDs

I have a data frame of individual animals with a unique ID, the lat/long where they were found, and the date they were found.我有一个单独的动物数据框,具有唯一 ID、发现它们的纬度/经度以及找到它们的日期。 The database has frequent returns of the same individual.数据库经常返回同一个人。 I have over 2000 individuals.我有超过2000个人。 I want to add a column to my data frame to calculate euclidian distance between current location & previous location.我想在我的数据框中添加一列来计算当前位置和先前位置之间的欧几里得距离。 I want to add a second column to tell me which calculation number I'm on for each individual.我想添加第二列来告诉我每个人的计算编号。 The data frame is already organized by sequential date.数据框已经按顺序日期组织。 I'm trying to solve this in R.我试图在 R 中解决这个问题。

Event事件 ID ID Lat纬度 Long
1 1个 1 1个 31.89 31.89 -80.98 -80.98
2 2个 2 2个 31.54 31.54 -80.12 -80.12
3 3个 1 1个 31.45 31.45 -81.92 -81.92
4 4个 1 1个 31.64 31.64 -81.82 -81.82
5 5个 2 2个 31.23 31.23 -80.98 -80.98

Add a column so that now it looks like添加一列,现在看起来像

Event事件 ID ID Lat纬度 Long Dist. Calculation #计算 #
1 1个 1 1个 31.89 31.89 -80.98 -80.98 - - 0 0
2 2个 2 2个 31.54 31.54 -80.12 -80.12 - - 0 0
3 3个 1 1个 31.45 31.45 -81.92 -81.92 Distance between event 1 & 3事件 1 和 3 之间的距离 1 1个
4 4个 1 1个 31.64 31.64 -81.82 -81.82 Distance between event 3 & 4事件 3 和 4 之间的距离 2 2个
5 5个 2 2个 31.23 31.23 -80.98 -80.98 Distance between event 2 & 5事件 2 和 5 之间的距离 1 1个

Is there a faster way to do this without a for loop?如果没有 for 循环,有没有更快的方法来做到这一点? I'm stuck on where to start.我被困在从哪里开始。 I know I can use a distance function from the geospatial package once, I have the uniqueID sorted, but I'm having trouble iterating through my data.我知道我可以使用距地理空间 package 的距离 function 一次,我对 uniqueID 进行了排序,但我无法遍历我的数据。

Here is one option which leans on the sf package and dplyr .这是一个依赖于sf package 和dplyr的选项。 The function sf::st_distance calculates distances between pairs of points, and dplyr::lag can be used to look "one row behind". function sf::st_distance计算点对之间的距离, dplyr::lag可用于查看“落后一排”。 You will want to confirm your coordinate system, which I guessed here is WGS84 / 4326 .您需要确认您的坐标系,我猜这里是WGS84 / 4326

library(dplyr)
library(sf)



dat <- read.table(text = " Event    ID  Lat Long
1   1   31.89   -80.98
2   2   31.54   -80.12
3   1   31.45   -81.92
4   1   31.64   -81.82
5   2   31.23   -80.98", h = T)


dat_sf <- st_as_sf(dat, coords = c('Long', 'Lat'), crs = 4326)


dat_sf %>%
  arrange(ID) %>%
  group_by(ID) %>%
  mutate(distance = as.numeric(st_distance(geometry, lag(geometry), by_element = TRUE)),
         calculation = row_number() - 1)
#> Simple feature collection with 5 features and 4 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -81.92 ymin: 31.23 xmax: -80.12 ymax: 31.89
#> Geodetic CRS:  WGS 84
#> # A tibble: 5 x 5
#> # Groups:   ID [2]
#>   Event    ID       geometry distance calculation
#> * <int> <int>    <POINT [°]>    <dbl>       <dbl>
#> 1     1     1 (-80.98 31.89)      NA            0
#> 2     3     1 (-81.92 31.45)  101524.           1
#> 3     4     1 (-81.82 31.64)   23155.           2
#> 4     2     2 (-80.12 31.54)      NA            0
#> 5     5     2 (-80.98 31.23)   88615.           1

Created on 2022-11-14 by the reprex package (v2.0.0)reprex package (v2.0.0) 创建于 2022-11-14

Try this:尝试这个:

  1. load library geosphere加载库地理范围
  2. create demo data创建演示数据
  3. get all unique IDs and sort dataframe by ID and event获取所有唯一 ID 并按 ID 和事件对 dataframe 进行排序
  4. append last known coords of each animal to each row append 每行每只动物的最后已知坐标
  5. apply distance function to each row将距离 function 应用到每一行
library(geosphere)
df <- data.frame(
    event = seq(5),
    id = c(1, 2, 1, 1, 2),
    lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
    long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)

keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)
df <- keys %>% lapply(
    function(key){
        tmp <- df[df$id == key, ]
        tmp$last_lat <- tmp$lat
        tmp$last_long <- tmp$long
        tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
        tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
        tmp %>% return
    }
) %>% do.call(rbind, .)


df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))

Since you said you need speed, below is the same code as above but run in parallel:既然你说你需要速度,下面是与上面相同但并行运行的代码:

library(tictoc)
library(parallel)

tic()
clust <- makeCluster(detectCores() - 1)

df <- data.frame(
    event = seq(5),
    id = c(1, 2, 1, 1, 2),
    lat = c(31.89, 31.54, 31.45, 31.64, 31.23),
    long = c(-80.98, -80.12, -81.92, -81.82, -80.98)
)
keys <- df$id %>% unique
df %<>% dplyr::arrange(id, event)

clusterExport(clust, "df")
clusterEvalQ(clust, library(magrittr))
df <- keys %>% parLapply(
    clust, ., 
    function(key){
        tmp <- df[df$id == key, ]
        tmp$last_lat <- tmp$lat
        tmp$last_long <- tmp$long
        tmp[2:nrow(tmp), ]$last_lat <- tmp[1:nrow(tmp) - 1, ]$lat
        tmp[2:nrow(tmp), ]$last_long <- tmp[1:nrow(tmp) - 1, ]$long
        tmp %>% return
    }
) %>% do.call(rbind, .)

df %<>% mutate(dist = distHaversine(cbind(long, lat), cbind(last_long, last_lat)))

toc()

Above, tictoc just records the execution time.上面, tictoc只是记录了执行时间。 I just created a cluster with the number of your cpu cores minus 1, and changed the lapply part to parLapply The second version will be slower than the first if you have a small dataset (due to overhead setting up the parallel computation).我刚刚创建了一个集群,你的 cpu 核心数减 1,并将lapply部分更改为parLapply如果你有一个小数据集,第二个版本将比第一个版本慢(由于设置并行计算的开销)。 But if you have a large dataset, the second version will be much faster.但是如果你有一个大数据集,第二个版本会快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM