[英]Average geodetic distance between subsets of points (same ID) using distm(distVincentyEllipsoid) and storing the results in a new dataframe in R
My database has the following structure: 我的数据库具有以下结构:
> long <- c(13.2345, 14.2478, 16.2001, 11.2489, 17.4784, 27.6478, 14.2500, 12.2100, 11.2014, 12.2147)
> lat <- c(47.1247, 48.2013, 41.2547, 41.2147, 40.3247, 46.4147, 42.4786, 41.2478, 48.2147, 47.2157)
> hh_id <- 1:10
> vill_id <- c(rep(100, 4), rep(101, 3), rep(102, 2), 103)
> df <- matrix(c(long, lat, hh_id, vill_id), nrow = 10, ncol = 4)
> colnames(df) <- c("longitude", "latitude", "hh_id", "vill_id")
> df <- as.data.frame(df)
> df
longitude latitude hh_id vill_id
13.2345 47.1247 1 100
14.2478 48.2013 2 100
16.2001 41.2547 3 100
11.2489 41.2147 4 100
17.4784 40.3247 5 101
27.6478 46.4147 6 101
14.2500 42.4786 7 101
12.2100 41.2478 8 102
11.2014 48.2147 9 102
12.2147 47.2157 10 103
hh_id - households IDs hh_id-家庭ID
vill_id - village IDs vill_id-村庄ID
Households with identical ID belong to the same village. 具有相同ID的家庭属于同一村庄。
My aim: calculate the mean distance between all points with the same vill_id and store the result in a new data frame: 我的目标是:计算具有相同vill_id的所有点之间的平均距离,并将结果存储在新的数据框中:
vill_id mean_dist
100 587553.5
101 …………………
102 …………………
103 ………………
My approach: To calculate the geodetic distance between points I have used the distm command from the geosphere package (distVincentyEllipsoid should be most accurate) 我的方法:要计算点之间的测地距离,我已使用geosphere包中的distm命令(distVincentyEllipsoid应该是最准确的)
> library(geosphere)
> df_100 <- df[df$vill_id == 100, ]
> dist_100 <- distm(df_100, fun = distVincentyEllipsoid)
Error in .pointsToMatrix(p1) : Wrong length for a vector, should be 2 -->
> df_100_2 <- df_100[, c(1, 2)]
> dist_100_2 <- distm(df_100_2, fun = distVincentyEllipsoid)
> dist_100_2
[,1] [,2] [,3] [,4]
[1,] 0.0 141844.7 693867.8 675556.9
[2,] 141844.7 0.0 787217.4 811777.4
[3,] 693867.8 787217.4 0.0 415056.6
[4,] 675556.9 811777.4 415056.6 0.0
So a symmetric distance matrix for all points with vill_id = 100 was generated. 因此,生成了vill_id = 100的所有点的对称距离矩阵。 To calculate the mean distance I need to to decompose this matrix (or drop all of the diagonal values (0)).
要计算平均距离,我需要分解该矩阵(或除去所有对角线值(0))。
> diag(dist_100_2) = NA
> dist_100_2_final <- dist_100_2[!is.na(dist_100_2)]
> dist_100_2_final
[1] 141844.7 693867.8 675556.9 141844.7 787217.4 811777.4 693867.8 787217.4 415056.6 675556.9
[11] 811777.4 415056.6
> mean(dist_100_2_final)
[1] 587553.5 (in m)
So far so good. 到现在为止还挺好。 Now I need to create a new dataframe which stores the mean distances for all subsets with the same ID (my original database has over 200 villages (vill_id) and almost 2000 households (hh_id)).
现在,我需要创建一个新的数据框,以存储具有相同ID的所有子集的平均距离(我的原始数据库有200多个村庄(vill_id)和近2000户(hh_id))。 Can you please help me how to finish the code?
您能帮我完成代码吗? I think I have to use loops (or maybe there is another package to solve this problem)?
我想我必须使用循环(或者也许有另一个软件包来解决这个问题)? Many thanks for your help.
非常感谢您的帮助。
Yesterday I have posted similar question with the difference that the mean_dist were already part of my original dataframe (computed in ArcGIS) but now I want to calculate these in R to compare the results. 昨天我发布了类似的问题,不同之处在于mean_dist已经是我的原始数据框的一部分(在ArcGIS中计算),但现在我想在R中计算这些值以比较结果。 I have tried to implement the recommended codes from my previous question but without success.
我尝试实施上一个问题中推荐的代码,但没有成功。
Consider base R's by
since you need to run an operation across different levels of factors (ie, vill_id ). 考虑基数R
by
因为您需要在不同级别的因子(即vill_id )上运行操作。 Inside by
, you can call a defined or anonymous function which will return a list of dataframes that you can row bind back to one dataframe: 在
by
,您可以调用定义的或匿名函数,该函数将返回一个数据框列表,您可以将其行绑定回一个数据框:
dfList <- by(df, df[c("vill_id")], FUN = function(i){
sub <- i[, c(1, 2)]
tmp <- distm(sub, fun = distVincentyEllipsoid)
diag(tmp) = NA
i$mean_dist <- mean(tmp[!is.na(tmp)]) # NEW COLUMN ADDED
return(i)
})
finaldf <- do.call(rbind, dfList)
Should you need vill_id and hh_id subset, add to the factor list: 如果您需要vill_id和hh_id子集,请添加到因子列表中:
dfList <- by(df, df[c("vill_id", "hh_id")], FUN = function(i){ ... })
And if you only need vill_id and mean_dist returned from function, change return value: 如果只需要从函数返回vill_id和mean_dist ,则更改返回值:
newdf <- unique(i[c("vill_id", "mean_dist")]
return(newdf)
Specifically, the following block of code: 具体来说,下面的代码块:
df_100 <- df[df$vill_id == 100, ] # BY REPLACES THIS LINE
df_100_2 <- df_100[, c(1, 2)]
dist_100_2 <- distm(df_100_2, fun = distVincentyEllipsoid)
diag(dist_100_2) = NA
dist_100_2_final <- dist_100_2[!is.na(dist_100_2)]
mean(dist_100_2_final)
Is translated as the following where i is the by
function variable: 转换为以下内容,其中i是
by
函数变量:
sub <- i[,c(1, 2)]
tmp <- distm(sub, fun = distVincentyEllipsoid)
diag(tmp) = NA
i$mean_dist <- mean(tmp[!is.na(tmp)])
Another way would be to use lapply()
. 另一种方法是使用
lapply()
。 I basically revised your code. 我基本上修改了您的代码。 One thing I added was to split your data by
vill_id
and create a list. 我添加的一件事是按
vill_id
拆分数据并创建一个列表。 Then, I applied your chunk of code for calculating distance to each split data frame in lapply()
. 然后,我将您的代码块应用于
lapply()
每个拆分数据帧的距离计算。 Finally, I created a data frame with mean values. 最后,我创建了一个带有平均值的数据框。
library(geosphere)
mylist <- split(df, f = df$vill_id)
unlist(lapply(mylist, function(x){
foo <- x[, 1:2]
foo <- distm(foo, fun = distVincentyEllipsoid)
diag(foo) = NA
out <- foo[!is.na(foo)]
average <- mean(out)
average
})
) -> mean_dist
data.frame(vill_id = unique(df$vill_id),
mean_dist = mean_dist)
# vill_id mean_dist
#100 100 587553.5
#101 101 858785.6
#102 102 778299.1
#103 103 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.