简体   繁体   English

基于数据帧内容的子集矩阵

[英]Subsetting Matrix Based on contents of dataframe

I have an 100X100 correlation matrix with zip codes as the column and row names. 我有一个100X100的相关矩阵,其中邮政编码为列名和行名。 I also have a data frame that contains the latitude and longitude for all zipcdes and a function that calculates the distance based on lat and long. 我还有一个数据框,其中包含所有zipcdes的纬度和经度,以及一个基于纬度和经度计算距离的函数。

Here is a snippet of the correlation matrix 这是相关矩阵的片段

            08846       48186       90621      92602       92701       92702       92703      92705      92706      92712
08846  1.00000000 -0.18704668  0.17631080 -0.0195590 -0.08640209 -0.09109788 -0.04251868 -0.1586506 -0.0778115 -0.0572327
48186 -0.18704668  1.00000000 -0.09365048  0.1616530  0.20468051  0.17682056  0.18009911  0.1417840  0.1958971  0.1938676
90621  0.17631080 -0.09365048  1.00000000  0.5880756  0.75200501  0.74694849  0.76071605  0.6593806  0.7640519  0.7657806
92602 -0.01955900  0.16165299  0.58807565  1.0000000  0.88187818  0.88947447  0.89310793  0.9615530  0.8926566  0.8926482
92701 -0.08640209  0.20468051  0.75200501  0.8818782  1.00000000  0.99314798  0.98011569  0.9294281  0.9827633  0.9886139
92702 -0.09109788  0.17682056  0.74694849  0.8894745  0.99314798  1.00000000  0.98791442  0.9470895  0.9853157  0.9933086
92703 -0.04251868  0.18009911  0.76071605  0.8931079  0.98011569  0.98791442  1.00000000  0.9321385  0.9938496  0.9981231
92705 -0.15865058  0.14178399  0.65938061  0.9615530  0.92942815  0.94708954  0.93213849  1.0000000  0.9268797  0.9357917
92706 -0.07781150  0.19589706  0.76405191  0.8926566  0.98276329  0.98531570  0.99384961  0.9268797  1.0000000  0.9948550
92712 -0.05723270  0.19386757  0.76578065  0.8926482  0.98861389  0.99330864  0.99812312  0.9357917  0.9948550  1.0000000

Here is snippet of the table of zip codes 这是邮政编码表的片段

    zip       city state latitude longitude
1 00210 Portsmouth    NH  43.0059  -71.0132
2 00211 Portsmouth    NH  43.0059  -71.0132
3 00212 Portsmouth    NH  43.0059  -71.0132
4 00213 Portsmouth    NH  43.0059  -71.0132
5 00214 Portsmouth    NH  43.0059  -71.0132
6 00215 Portsmouth    NH  43.0059  -71.0132

And here is the function taht calculates distance bwteen lat and long. 这是taht计算纬度和经度之间的距离的函数。

Calc_Dist <- function (long1, lat1, long2, lat2)
{
  rad <- pi/180
  a1 <- lat1 * rad
  a2 <- long1 * rad
  b1 <- lat2 * rad
  b2 <- long2 * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  a <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(a), sqrt(1 - a))
  R <- 6378.145
  d <- R * c
  return(d)
}

My goal here is to subset the correlation matrix to only include zip codes that are more than 500 miles apart (right now the distance calculation outputs in kilometers but that can be easily changed and is immaterial right now). 我在这里的目标是对相关矩阵进行子集处理,以仅包括相距超过500英里的邮政编码(现在距离计算的输出以公里为单位,但是可以轻松更改,并且现在不重要)。 The less expensive the better as I may have to do this with larger correlation matrices (~10000 x 10000). 价格越便宜越好,因为我可能需要使用更大的相关矩阵(〜10000 x 10000)。 Any suggestions? 有什么建议么?

Thanks in advance, Ben 预先感谢,本

Is it critical that you have to use that distance function? 您必须使用该距离功能至关重要吗? I think the dist should be much more efficient. 我认为dist应该更有效。

#Making your zip.table a data.table helps us with speed
library(reshape)
library(data.table)
setDT(zip.table) 

#Calculate distance matrix and put into table form
setorder(zip.dist,zip)
zip.dist <- dist(zip.table[,.(longitude=abs(longitude),latitude)])
zip.dist <- as.matrix(zip.dist)
zip.dist <- melt(zip.dist)[melt(upper.tri(zip.dist))$value,]
setDT(zip.dist)
setnames(zip.dist,c("zip1", "zip2", "distance"))

#Do a very similar procedure with your correlation matrix
#It is important that you sorted your zip.table by zip before applying `cor`
zip.corr <- as.matrix(zip.corr)
zip.corr <- melt(zip.corr)[melt(upper.tri(zip.corr))$value,]
setDT(zip.corr)
setnames(zip.corr,c("zip1", "zip2", "cor"))

#Subset zip.dist to only include zip codes more than 500 miles apart
zip.dist <- zip.dist[distance*69 > 500] #69 mile ~ 1 degreen lat/lon

#Merge together
setkey(zip.dist,zip1,zip2)
setkey(zip.corr,zip1,zip2)
result.table <- zip.dist[zip.corr, nomatch=0]

Since these places are all pretty close to one another, I don't think you lose much by using euclidean distance. 由于这些地方彼此之间非常接近,所以我认为使用欧几里得距离不会给您带来太多损失。 Especially since it is one lat/lon inside of a large county. 特别是因为它是一个大县内的一个经纬度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM