My database has the following structure:
> long <- c(13.2345, 14.2478, 16.2001, 11.2489, 17.4784, 27.6478, 14.2500, 12.2100, 11.2014, 12.2147)
> lat <- c(47.1247, 48.2013, 41.2547, 41.2147, 40.3247, 46.4147, 42.4786, 41.2478, 48.2147, 47.2157)
> hh_id <- 1:10
> vill_id <- c(rep(100, 4), rep(101, 3), rep(102, 2), 103)
> df <- matrix(c(long, lat, hh_id, vill_id), nrow = 10, ncol = 4)
> colnames(df) <- c("longitude", "latitude", "hh_id", "vill_id")
> df <- as.data.frame(df)
> df
longitude latitude hh_id vill_id
13.2345 47.1247 1 100
14.2478 48.2013 2 100
16.2001 41.2547 3 100
11.2489 41.2147 4 100
17.4784 40.3247 5 101
27.6478 46.4147 6 101
14.2500 42.4786 7 101
12.2100 41.2478 8 102
11.2014 48.2147 9 102
12.2147 47.2157 10 103
hh_id - households IDs
vill_id - village IDs
Households with identical ID belong to the same village.
My aim: calculate the mean distance between all points with the same vill_id and store the result in a new data frame:
vill_id mean_dist
100 587553.5
101 …………………
102 …………………
103 ………………
My approach: To calculate the geodetic distance between points I have used the distm command from the geosphere package (distVincentyEllipsoid should be most accurate)
> library(geosphere)
> df_100 <- df[df$vill_id == 100, ]
> dist_100 <- distm(df_100, fun = distVincentyEllipsoid)
Error in .pointsToMatrix(p1) : Wrong length for a vector, should be 2 -->
> df_100_2 <- df_100[, c(1, 2)]
> dist_100_2 <- distm(df_100_2, fun = distVincentyEllipsoid)
> dist_100_2
[,1] [,2] [,3] [,4]
[1,] 0.0 141844.7 693867.8 675556.9
[2,] 141844.7 0.0 787217.4 811777.4
[3,] 693867.8 787217.4 0.0 415056.6
[4,] 675556.9 811777.4 415056.6 0.0
So a symmetric distance matrix for all points with vill_id = 100 was generated. To calculate the mean distance I need to to decompose this matrix (or drop all of the diagonal values (0)).
> diag(dist_100_2) = NA
> dist_100_2_final <- dist_100_2[!is.na(dist_100_2)]
> dist_100_2_final
[1] 141844.7 693867.8 675556.9 141844.7 787217.4 811777.4 693867.8 787217.4 415056.6 675556.9
[11] 811777.4 415056.6
> mean(dist_100_2_final)
[1] 587553.5 (in m)
So far so good. Now I need to create a new dataframe which stores the mean distances for all subsets with the same ID (my original database has over 200 villages (vill_id) and almost 2000 households (hh_id)). Can you please help me how to finish the code? I think I have to use loops (or maybe there is another package to solve this problem)? Many thanks for your help.
Yesterday I have posted similar question with the difference that the mean_dist were already part of my original dataframe (computed in ArcGIS) but now I want to calculate these in R to compare the results. I have tried to implement the recommended codes from my previous question but without success.
Consider base R's by
since you need to run an operation across different levels of factors (ie, vill_id ). Inside by
, you can call a defined or anonymous function which will return a list of dataframes that you can row bind back to one dataframe:
dfList <- by(df, df[c("vill_id")], FUN = function(i){
sub <- i[, c(1, 2)]
tmp <- distm(sub, fun = distVincentyEllipsoid)
diag(tmp) = NA
i$mean_dist <- mean(tmp[!is.na(tmp)]) # NEW COLUMN ADDED
return(i)
})
finaldf <- do.call(rbind, dfList)
Should you need vill_id and hh_id subset, add to the factor list:
dfList <- by(df, df[c("vill_id", "hh_id")], FUN = function(i){ ... })
And if you only need vill_id and mean_dist returned from function, change return value:
newdf <- unique(i[c("vill_id", "mean_dist")]
return(newdf)
Specifically, the following block of code:
df_100 <- df[df$vill_id == 100, ] # BY REPLACES THIS LINE
df_100_2 <- df_100[, c(1, 2)]
dist_100_2 <- distm(df_100_2, fun = distVincentyEllipsoid)
diag(dist_100_2) = NA
dist_100_2_final <- dist_100_2[!is.na(dist_100_2)]
mean(dist_100_2_final)
Is translated as the following where i is the by
function variable:
sub <- i[,c(1, 2)]
tmp <- distm(sub, fun = distVincentyEllipsoid)
diag(tmp) = NA
i$mean_dist <- mean(tmp[!is.na(tmp)])
Another way would be to use lapply()
. I basically revised your code. One thing I added was to split your data by vill_id
and create a list. Then, I applied your chunk of code for calculating distance to each split data frame in lapply()
. Finally, I created a data frame with mean values.
library(geosphere)
mylist <- split(df, f = df$vill_id)
unlist(lapply(mylist, function(x){
foo <- x[, 1:2]
foo <- distm(foo, fun = distVincentyEllipsoid)
diag(foo) = NA
out <- foo[!is.na(foo)]
average <- mean(out)
average
})
) -> mean_dist
data.frame(vill_id = unique(df$vill_id),
mean_dist = mean_dist)
# vill_id mean_dist
#100 100 587553.5
#101 101 858785.6
#102 102 778299.1
#103 103 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.