存储对称数据矩阵nxn = 26亿的最佳方法

Question

i have postgresql with the postgis extension installed and a data table of zipcodes with lat/long as a point field. 我有安装了postgis扩展名的postgresql和一个以lat / long为点字段的邮政编码数据表。 i wish to return zips within a variable distance of some zip, like 我希望在某些拉链的可变距离内返回拉链，例如

return all zips within x miles of zip 12345 返回zip 12345 x英里内的所有zip

there are about 51,000 zipcodes. 大约有51,000个邮政编码。 precomputing all would allow for lookups without computation. 预先计算全部将允许查找而无需计算。 right now i'm doing comps on-the-fly. 现在我正在即时制作伴奏。 the computed data could be arranged in a symmetric matrix. 计算的数据可以安排在对称矩阵中。

i was thinking of this solution: 我在想这个解决方案：

if we accept that the distance of a zip from itself is implied to be zero, then i could load a table with n^2/2-n rows (about 1.3 billion rows), with columns z1 z2 d, and then do a compound index on z1+d to return my queryset containing the list of z2. 如果我们接受一个zip距它本身的距离为零，那么我可以加载一个n ^ 2 / 2-n行（约13亿行），z1 z2 d列的表，然后进行复合在z1 + d上建立索引以返回包含z2列表的查询集。

my question is how would you handle it for efficient return on-the-fly. 我的问题是，您将如何处理它以获得即时的有效回报。 possibly abandon sql after all distance calculations? 所有距离计算之后是否有可能放弃sql？ leave it how i have it doing comps at query time? 留下它我如何在查询时进行补偿？ i don't care too much about complete distance computation time or indexing time. 我不太在乎完整的距离计算时间或索引时间。 i'd do these annually, or at most quarterly. 我每年或最多每季一次。 storage might also be a concern? 存储也可能是一个问题？

Answer 1

That's an interesting question. 这是一个有趣的问题。 I think an rdbms perfect for this task. 我认为rdbms非常适合此任务。 No need to abandon it. 无需放弃。

As to storing pre-computed distances: I would only do this if really needed, ie if you have performance issues. 关于存储预先计算的距离：我只会在确实需要时才执行此操作，即，如果您遇到性能问题。 After all it's redundant data that must be maintained. 毕竟，必须维护冗余数据。 If you decide for such a table, I agree with Vesper; 如果您决定使用这种桌子，我同意Vesper； store all n^2 rows, for otherwise you will always have to combine two queries; 存储所有n ^ 2行，否则，您将始终不得不合并两个查询； one to look up your zip code in z1, one to look it up in z2. 一个可以在z1中查找邮政编码，另一个可以在z2中查找邮政编码。

But maybe you can speed up your existing query. 但是也许您可以加快现有查询的速度。 I don't know how you went about it. 我不知道你是怎么做到的。 I remember the formula for distances to be quite complicated. 我记得距离公式非常复杂。 So what I would do is to calculate the extreme latitudes and longitudes being within the desired range first (ie if I stay in the same latitude, what are the minimum and maximum longitudes still in that range; if I stay in the same longitude, what are the minimum and maximum latitudes). 因此，我要做的是首先计算处于所需范围内的极端纬度和经度（即，如果我保持相同的纬度，则该范围内的最小和最大经度是多少；如果我保持相同的经度，那么是最小和最大纬度）。 With the values calculated you can select all zip codes in that rectangle with BETWEEN (so indexes on longitude and latitude might come handy) and then only use the exact formula on the records thus found. 通过计算出的值，您可以使用BETWEEN选择该矩形中的所有邮政编码（因此可能会方便使用关于经度和纬度的索引），然后仅对找到的记录使用确切的公式。

EDIT: I have given it more thought. 编辑：我给了更多的想法。 If this database only exists for the task you describe, then yes, why not have another table for this particular purpose. 如果该数据库仅针对您描述的任务而存在，那么可以，为什么不为该特定目的准备另一个表。 You are right to mention storage. 您正确地提到存储。 This table will need several GB and the index will take a lot of space, too. 该表将需要几个GB，索引也将占用大量空间。 But with enough hard disk space available, this should be no problem. 但是，如果有足够的可用硬盘空间，这应该没问题。

Answer 2

Have you considered using EarthDistance? 您是否考虑过使用EarthDistance？ Within it, you can index "boxes" which are areas that basically "square off" your search area instead of it being round, so it can be indexed easier.. then, within your query, you also include a "radius" type query that eliminates the extra results returned using the box method. 在其中，您可以为“框”建立索引，这些框基本上是在您的搜索区域“方形”而不是圆形的，因此可以更轻松地对其进行索引。.然后，在您的查询中，还包括“ radius”类型查询消除了使用box方法返回的多余结果。

http://www.postgresql.org/docs/9.2/static/earthdistance.html http://www.postgresql.org/docs/9.2/static/earthdistance.html

Answer 3

Postgres/PostGIS spatial indexes are designed to do exactly this kind of search. Postgres / PostGIS空间索引旨在进行这种搜索。 They are based on R-trees, http://en.wikipedia.org/wiki/R_tree , which essentially subdivide your spatial data into boxes, ie, it is a 2-dimensional. 它们基于R树， http：//en.wikipedia.org/wiki/R_tree ，它实际上将您的空间数据细分为多个框，即它是二维的。 There is a function, ST_DWithin, which will return all the geometries within distance x, of some other geometry. 有一个函数ST_DWithin，它将返回距离x内所有其他几何图形的所有几何图形。 So, given a table of zip codes and points (called geom) representing lat/long locations, you can write queries, such as, 因此，给定一个表示纬度/经度位置的邮政编码和点（称为geom）表，您可以编写查询，例如，

select zip, geom from zipcodes z, 
  (select geom from zipcodes where zip=12345) s 
where ST_DWithin(s.geom, z.geom, 10000)
  order by ST_Distance(s.geom, z.geom) limit 5;

which will return the nearest 5 zip codes within 10km of the zip code 12345. 它将返回距离邮政编码12345 10公里以内的最近5个邮政编码。

As you can index both the zip code and the geometry field very efficiently, it would be unnecesary, imho, to store a matrix of all the possible distances, as spatial indexes perform well with tens of millions of rows. 由于您可以非常高效地对邮政编码和几何字段建立索引，因此存储空间所有可能距离的矩阵将是不必要的，恕我直言，因为空间索引在处理数千万行时表现良好。

Creating a spatial index in Posgis is as easy as; 在Posgis中创建空间索引非常简单；

create index ix_spatial_zips on zipcodes using gist(geom);

I realize that this doesn't answer you original question exactly, but this means you will only need to store 51,000 rows, rather than the cartesian product of that number, and the performance will be better too. 我意识到这并不能完全回答您最初的问题，但这意味着您只需要存储51,000行，而不是该数目的笛卡尔乘积，性能也将更好。

存储对称数据矩阵nxn = 26亿的最佳方法

问题描述

3 个解决方案

解决方案1
1 2014-05-06 12:27:37

解决方案2
1 2014-05-06 14:45:55

解决方案3
1 2014-05-06 19:07:04

存储对称数据矩阵nxn = 26亿的最佳方法

问题描述

3 个解决方案

解决方案1 1 2014-05-06 12:27:37

解决方案2 1 2014-05-06 14:45:55

解决方案3 1 2014-05-06 19:07:04

解决方案1
1 2014-05-06 12:27:37

解决方案2
1 2014-05-06 14:45:55

解决方案3
1 2014-05-06 19:07:04