简体繁体 English

关于卡桑德拉的泰坦地理数据

[英]Titan geo data on Cassandra

原文 2014-03-15 03:14:20 1 1 graph/ cassandra/ titan

I'm looking at using Titan to create a scalable geospatial data store (I'm thinking R trees). 我正在寻找使用Titan来创建可扩展的地理空间数据存储（我在想R树）。 In the documentation, there is a GeoShape query, and the docs say that titan can do geo data with Lucene or ElasticSearch. 在文档中，有一个GeoShape查询，文档说titan可以使用Lucene或ElasticSearch执行地理数据。 However, it seems like this would be very slow because traversing nodes in cassandra is essentially doing join queries in cassandra which is a really bad idea. 但是，看起来这会非常慢，因为在cassandra中遍历节点本质上是在cassandra中进行连接查询，这是一个非常糟糕的主意。 I think I might be misunderstanding the data representation. 我想我可能会误解数据表示。

I read the Titan Data Model doc , and I still don't quite get it. 我阅读了Titan数据模型文档，但我仍然不太了解它。 If all the edges are stored in a Cassandra row, then Titan would still have to "join" on a vertex table. 如果所有边都存储在Cassandra行中，那么Titan仍然必须在顶点表上“连接”。 One way to solve this would be to make the column value equal to the edge property data, and then you could neatly package the vertex data and the edge data into the row. 解决此问题的一种方法是使列值等于边属性数据，然后您可以将顶点数据和边数据整齐地打包到行中。 However, this breaks down when you want to do queries deeper than 1 node, and we're back to the joining problem again. 但是，当您想要执行超过1个节点的查询时，这会中断，我们又会再次回到加入问题。

So. 所以。 Is titan emulating join queries in Cassandra? Titan是否在Cassandra中模拟连接查询？ - and - How performant is it at geo lookups under these conditions? - 和 - 在这些条件下地理查找的性能如何？

1 个解决方案

I think the question conflates edge traversal with geospatial index lookups. 我认为这个问题将边缘遍历与地理空间索引查找混为一谈。 These are separate at both the API and implementation levels. 它们在API和实现级别都是分开的。 The index is not illustrated in the data model pictures. 索引未在数据模型图片中示出。

Let's make this a little bit more specific. 让我们更具体一点。 Say I run Titan with ES and Cassandra using Murmur3Partitioner or RandomPartitioner. 假设我使用Murmur3Partitioner或RandomPartitioner与ES和Cassandra一起运行Titan。 I declare an ES geospatial index over edges called "place", as documented in the Getting Started page . 我在边缘称为“地点”的ES地理空间索引，如“入门”页面中所述。 Looking up edges by geospatial queries, such as this "WITHIN" in the Getting Started docs , first hits ES. 通过地理空间查询（例如“入门”文档中的“WITHIN”）查找边缘，首先点击ES。 ES returns IDs Titan can use to lookup the associated vertex/edge data in Cassandra quickly, without doing an analog to relational joins. ES返回ID Titan可以用来快速查找Cassandra中相关的顶点/边缘数据，而无需对关系连接进行模拟。

The cost of these edge lookups by geospatial data should be roughly equivalent to the cost of ES's WITHIN implementation (which I think is delegated to Spatial4j), plus the lookups Titan makes on Cassandra after getting IDs, which should be roughly linear in the number of edges found by ES. 地理空间数据的这些边缘查找的成本应该大致相当于ES的WITHIN实现（我认为委托给Spatial4j）的成本，加上Titan在获取ID后对Cassandra进行的查找，其数量应大致为线性。 ES发现的边缘。 This is just back-of-the-envelope estimation, so please take it with a big grain of salt. 这只是背后的估计，所以请大量使用它。

After I get place edges by geo matching, if I then want to run arbitrary traversals in the neighborhood of each edge in the set, then I would have a look at rooting a MultiQuery on the head/tail vertices and enabling database-level caching. 在我通过地理匹配获得边缘之后，如果我想在集合中的每个边缘附近运行任意遍历，那么我将看看在头/尾顶点上生成MultiQuery并启用数据库级缓存。 If the query misses cache or cache is cold/disabled, then Titan will still attempt to retrieve all edges the traversal cares about in a single Cassandra slice per vertex, when possible. 如果查询未命中缓存或缓存已冷/禁用，则Titan仍会尝试在可能的情况下检索遍历在每个顶点的单个Cassandra切片中所关注的所有边。 If you're concerned about Titan's edge traversal efficiency, then you might find Boutique Graph Data with Titan interesting. 如果您担心Titan的边缘遍历效率，那么您可能会发现带有Titan的Boutique Graph Data很有趣。