[英]Subsetting Matrix Based on contents of dataframe
I have an 100X100 correlation matrix with zip codes as the column and row names. 我有一个100X100的相关矩阵,其中邮政编码为列名和行名。 I also have a data frame that contains the latitude and longitude for all zipcdes and a function that calculates the distance based on lat and long.
我还有一个数据框,其中包含所有zipcdes的纬度和经度,以及一个基于纬度和经度计算距离的函数。
Here is a snippet of the correlation matrix 这是相关矩阵的片段
08846 48186 90621 92602 92701 92702 92703 92705 92706 92712
08846 1.00000000 -0.18704668 0.17631080 -0.0195590 -0.08640209 -0.09109788 -0.04251868 -0.1586506 -0.0778115 -0.0572327
48186 -0.18704668 1.00000000 -0.09365048 0.1616530 0.20468051 0.17682056 0.18009911 0.1417840 0.1958971 0.1938676
90621 0.17631080 -0.09365048 1.00000000 0.5880756 0.75200501 0.74694849 0.76071605 0.6593806 0.7640519 0.7657806
92602 -0.01955900 0.16165299 0.58807565 1.0000000 0.88187818 0.88947447 0.89310793 0.9615530 0.8926566 0.8926482
92701 -0.08640209 0.20468051 0.75200501 0.8818782 1.00000000 0.99314798 0.98011569 0.9294281 0.9827633 0.9886139
92702 -0.09109788 0.17682056 0.74694849 0.8894745 0.99314798 1.00000000 0.98791442 0.9470895 0.9853157 0.9933086
92703 -0.04251868 0.18009911 0.76071605 0.8931079 0.98011569 0.98791442 1.00000000 0.9321385 0.9938496 0.9981231
92705 -0.15865058 0.14178399 0.65938061 0.9615530 0.92942815 0.94708954 0.93213849 1.0000000 0.9268797 0.9357917
92706 -0.07781150 0.19589706 0.76405191 0.8926566 0.98276329 0.98531570 0.99384961 0.9268797 1.0000000 0.9948550
92712 -0.05723270 0.19386757 0.76578065 0.8926482 0.98861389 0.99330864 0.99812312 0.9357917 0.9948550 1.0000000
Here is snippet of the table of zip codes 这是邮政编码表的片段
zip city state latitude longitude
1 00210 Portsmouth NH 43.0059 -71.0132
2 00211 Portsmouth NH 43.0059 -71.0132
3 00212 Portsmouth NH 43.0059 -71.0132
4 00213 Portsmouth NH 43.0059 -71.0132
5 00214 Portsmouth NH 43.0059 -71.0132
6 00215 Portsmouth NH 43.0059 -71.0132
And here is the function taht calculates distance bwteen lat and long. 这是taht计算纬度和经度之间的距离的函数。
Calc_Dist <- function (long1, lat1, long2, lat2)
{
rad <- pi/180
a1 <- lat1 * rad
a2 <- long1 * rad
b1 <- lat2 * rad
b2 <- long2 * rad
dlon <- b2 - a2
dlat <- b1 - a1
a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
R <- 6378.145
d <- R * c
return(d)
}
My goal here is to subset the correlation matrix to only include zip codes that are more than 500 miles apart (right now the distance calculation outputs in kilometers but that can be easily changed and is immaterial right now). 我在这里的目标是对相关矩阵进行子集处理,以仅包括相距超过500英里的邮政编码(现在距离计算的输出以公里为单位,但是可以轻松更改,并且现在不重要)。 The less expensive the better as I may have to do this with larger correlation matrices (~10000 x 10000).
价格越便宜越好,因为我可能需要使用更大的相关矩阵(〜10000 x 10000)。 Any suggestions?
有什么建议么?
Thanks in advance, Ben 预先感谢,本
Is it critical that you have to use that distance function? 您必须使用该距离功能至关重要吗? I think the
dist
should be much more efficient. 我认为
dist
应该更有效。
#Making your zip.table a data.table helps us with speed
library(reshape)
library(data.table)
setDT(zip.table)
#Calculate distance matrix and put into table form
setorder(zip.dist,zip)
zip.dist <- dist(zip.table[,.(longitude=abs(longitude),latitude)])
zip.dist <- as.matrix(zip.dist)
zip.dist <- melt(zip.dist)[melt(upper.tri(zip.dist))$value,]
setDT(zip.dist)
setnames(zip.dist,c("zip1", "zip2", "distance"))
#Do a very similar procedure with your correlation matrix
#It is important that you sorted your zip.table by zip before applying `cor`
zip.corr <- as.matrix(zip.corr)
zip.corr <- melt(zip.corr)[melt(upper.tri(zip.corr))$value,]
setDT(zip.corr)
setnames(zip.corr,c("zip1", "zip2", "cor"))
#Subset zip.dist to only include zip codes more than 500 miles apart
zip.dist <- zip.dist[distance*69 > 500] #69 mile ~ 1 degreen lat/lon
#Merge together
setkey(zip.dist,zip1,zip2)
setkey(zip.corr,zip1,zip2)
result.table <- zip.dist[zip.corr, nomatch=0]
Since these places are all pretty close to one another, I don't think you lose much by using euclidean distance. 由于这些地方彼此之间非常接近,所以我认为使用欧几里得距离不会给您带来太多损失。 Especially since it is one lat/lon inside of a large county.
特别是因为它是一个大县内的一个经纬度。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.