简体   繁体   中英

Average geodetic distance between subsets of points (same ID) using distm(distVincentyEllipsoid) and storing the results in a new dataframe in R

My database has the following structure:

    > long <- c(13.2345, 14.2478, 16.2001, 11.2489, 17.4784, 27.6478, 14.2500, 12.2100, 11.2014, 12.2147)
    > lat <- c(47.1247, 48.2013, 41.2547, 41.2147, 40.3247, 46.4147, 42.4786, 41.2478, 48.2147, 47.2157)
    > hh_id <- 1:10
    > vill_id <- c(rep(100, 4), rep(101, 3), rep(102, 2), 103)

    > df <- matrix(c(long, lat, hh_id, vill_id), nrow = 10, ncol = 4)
    > colnames(df) <- c("longitude", "latitude", "hh_id", "vill_id") 
    > df <- as.data.frame(df)
    > df
       longitude latitude hh_id vill_id
       13.2345  47.1247     1     100
       14.2478  48.2013     2     100
       16.2001  41.2547     3     100
       11.2489  41.2147     4     100
       17.4784  40.3247     5     101
       27.6478  46.4147     6     101
       14.2500  42.4786     7     101
       12.2100  41.2478     8     102
       11.2014  48.2147     9     102
       12.2147  47.2157    10     103

hh_id - households IDs

vill_id - village IDs

Households with identical ID belong to the same village.

My aim: calculate the mean distance between all points with the same vill_id and store the result in a new data frame:

vill_id    mean_dist
100        587553.5
101        …………………
102        …………………
103        ………………

My approach: To calculate the geodetic distance between points I have used the distm command from the geosphere package (distVincentyEllipsoid should be most accurate)

> library(geosphere)
> df_100 <- df[df$vill_id == 100, ]
> dist_100 <- distm(df_100, fun = distVincentyEllipsoid)
Error in .pointsToMatrix(p1) : Wrong length for a vector, should be 2 --> 
> df_100_2 <- df_100[, c(1, 2)]
> dist_100_2 <- distm(df_100_2, fun = distVincentyEllipsoid)
> dist_100_2
         [,1]     [,2]     [,3]     [,4]
[1,]      0.0 141844.7 693867.8 675556.9
[2,] 141844.7      0.0 787217.4 811777.4
[3,] 693867.8 787217.4      0.0 415056.6
[4,] 675556.9 811777.4 415056.6      0.0

So a symmetric distance matrix for all points with vill_id = 100 was generated. To calculate the mean distance I need to to decompose this matrix (or drop all of the diagonal values (0)).

> diag(dist_100_2) = NA
> dist_100_2_final <- dist_100_2[!is.na(dist_100_2)]
> dist_100_2_final
 [1] 141844.7 693867.8 675556.9 141844.7 787217.4 811777.4 693867.8 787217.4 415056.6 675556.9
[11] 811777.4 415056.6
> mean(dist_100_2_final)
[1] 587553.5 (in m)

So far so good. Now I need to create a new dataframe which stores the mean distances for all subsets with the same ID (my original database has over 200 villages (vill_id) and almost 2000 households (hh_id)). Can you please help me how to finish the code? I think I have to use loops (or maybe there is another package to solve this problem)? Many thanks for your help.

Yesterday I have posted similar question with the difference that the mean_dist were already part of my original dataframe (computed in ArcGIS) but now I want to calculate these in R to compare the results. I have tried to implement the recommended codes from my previous question but without success.

Consider base R's by since you need to run an operation across different levels of factors (ie, vill_id ). Inside by , you can call a defined or anonymous function which will return a list of dataframes that you can row bind back to one dataframe:

dfList <- by(df, df[c("vill_id")], FUN = function(i){
     sub <- i[, c(1, 2)]
     tmp <- distm(sub, fun = distVincentyEllipsoid)
     diag(tmp) = NA
     i$mean_dist <- mean(tmp[!is.na(tmp)])                  # NEW COLUMN ADDED
     return(i)
})

finaldf <- do.call(rbind, dfList)

Should you need vill_id and hh_id subset, add to the factor list:

dfList <- by(df, df[c("vill_id", "hh_id")], FUN = function(i){ ... })

And if you only need vill_id and mean_dist returned from function, change return value:

newdf <- unique(i[c("vill_id", "mean_dist")]
return(newdf)

Specifically, the following block of code:

df_100 <- df[df$vill_id == 100, ]                            # BY REPLACES THIS LINE
df_100_2 <- df_100[, c(1, 2)]
dist_100_2 <- distm(df_100_2, fun = distVincentyEllipsoid)                 
diag(dist_100_2) = NA
dist_100_2_final <- dist_100_2[!is.na(dist_100_2)]
mean(dist_100_2_final)

Is translated as the following where i is the by function variable:

sub <- i[,c(1, 2)]
tmp <- distm(sub, fun = distVincentyEllipsoid)
diag(tmp) = NA
i$mean_dist <- mean(tmp[!is.na(tmp)])

Another way would be to use lapply() . I basically revised your code. One thing I added was to split your data by vill_id and create a list. Then, I applied your chunk of code for calculating distance to each split data frame in lapply() . Finally, I created a data frame with mean values.

library(geosphere)

mylist <- split(df, f = df$vill_id)

unlist(lapply(mylist, function(x){

        foo <- x[, 1:2]
        foo <- distm(foo, fun = distVincentyEllipsoid)
        diag(foo) = NA
        out <- foo[!is.na(foo)]
        average <- mean(out)
        average
      })
) -> mean_dist

data.frame(vill_id = unique(df$vill_id),
           mean_dist = mean_dist)

#    vill_id mean_dist
#100     100  587553.5
#101     101  858785.6
#102     102  778299.1
#103     103       NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM