简体   繁体   English

App Engine:13 StringPropertys vs. 1 StringListProperty(w.r.t。索引/存储和查询性能)

[英]App Engine: 13 StringPropertys vs. 1 StringListProperty (w.r.t. indexing/storage and query performance)

A bit of background first: GeoModel is a library I wrote that adds very basic geospatial indexing and querying functionality to App Engine apps. 首先介绍一下背景: GeoModel是我编写的一个库,它为App Engine应用程序添加了非常基本的地理空间索引和查询功能。 It is similar in approach to geohashing. 它与geohashing的方法类似。 The equivalent location hash in GeoModel is called a 'geocell.' GeoModel中的等效位置哈希称为“geocell”。

Currently, the GeoModel library adds 13 properties (location_geocell__n_, n =1..13) to each location-aware entity. 目前,GeoModel库为每个位置感知实体添加了13个属性(location_geocell__n_, n = 1..13)。 For example, an entity can have property values such as: 例如,实体可以具有属性值,例如:

location_geocell_1 = 'a'
location_geocell_2 = 'a3'
location_geocell_3 = 'a3f'
...

This is required in order to not use up an inequality filter during spatial queries. 这是在空间查询期间不使用不等式过滤器所必需的。

The problem with the 13-properties approach is that, for any geo query an app would like to run, 13 new indexes must be defined and built. 13属性方法的问题在于,对于任何想要运行的地理查询,必须定义和构建13个新索引。 This is definitely a maintenance hassle, as I've just painfully realized while rewriting the demo app for the project. 这绝对是一个维护麻烦,因为我在为项目重写演示应用程序时痛苦地意识到了这一点。 This leads to my first question: 这导致了我的第一个问题:

QUESTION 1: Is there any significant storage overhead per index? 问题1: 每个索引是否有任何重要的存储开销? ie if I have 13 indexes with n entities in each, versus 1 index with 13n entities in it, is the former much worse than the latter in terms of storage? 也就是说,如果我有13个索引,每个索引中有n个实体,而1个索引中有13个实体,那么前者在存储方面要比后者差得多吗?

It seems like the answer to (1) is no, per this article , but I'd just like to see if anyone has had a different experience. 根据这篇文章 ,似乎(1)的答案是否定的,但我只想看看是否有人有过不同的经历。

Now, I'm considering adjusting the GeoModel library so that instead of 13 string properties, there'd only be one StringListProperty called location_geocells, ie: 现在,我正在考虑调整GeoModel库,以便代替13个字符串属性,只有一个名为location_geocells的StringListProperty,即:

location_geocells = ['a', 'a3', 'a3f']

This results in a much cleaner index.yaml . 这导致了更清晰的index.yaml But, I do question the performance implications: 但是,我确实质疑性能影响:

QUESTION 2: If I switch from 13 string properties to 1 StringListProperty, will query performance be adversely affected; 问题2: 如果我从13个字符串属性切换到1个StringListProperty,查询性能会受到不利影响; my current filter looks like: 我当前的过滤器看起来像:

query.filter('location_geocell_%d =' % len(search_cell), search_cell)

and the new filter would look like: 并且新的过滤器看起来像:

query.filter('location_geocells =', search_cell)

Note that the first query has a search space of _n_ entities, whereas the second query has a search space of _13n_ entities. 请注意,第一个查询的搜索空间为_n_个实体,而第二个查询的搜索空间为_13n_个实体。

It seems like the answer to (2) is that both result in equal query performance, per tip #6 in this blog post , but again, I'd like to see if anyone has any differing real-world experiences with this. 似乎(2)的答案是,在这篇博客文章中 ,每个提示#6都会产生相同的查询性能,但同样,我想看看是否有人对此有任何不同的实际经验。

Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use (specifically wrt index.yaml), please do let me know! 最后,如果有人有任何其他建议或提示可以帮助提高存储利用率,查询性能和/或易用性(特别是wrt index.yaml),请告诉我! The source can be found here geomodel & geomodel.py 源代码可以在这里找到geomodelgeomodel.py

You're correct that there's no significant overhead per-index - 13n entries in one index is more or less equivalent to n entries in 13 indexes. 你是正确的,每个索引没有明显的开销 - 一个索引中的13n个条目或多或少等于13个索引中的n个条目。 There's a total index count limit of 100, though, so this eats up a fair chunk of your available indexes. 但是,总索引计数限制为100,因此这会占用可用索引的一大部分。

That said, using a ListProperty is definitely a far superior approach from usability and index consumption perspectives. 也就是说,从可用性和索引消费的角度来看,使用ListProperty绝对是一种非常优越的方法。 There is, as you supposed, no performance difference between querying a small index and a much larger index, supposing both queries return the same number of rows. 正如您所说,查询小索引和更大索引之间没有性能差异,假设两个查询返回相同数量的行。

The only reason I can think of for using separate properties is if you knew you only needed to index on certain levels of detail - but that could be accomplished better at insert-time by specifying the levels of detail you want added to the list in the first place. 我可以想到使用单独属性的唯一原因是,如果您知道您只需要对某些细节级别进行索引 - 但是在插入时可以通过指定要添加到列表中的详细信息级别来更好地完成。第一名。

Note that in either case you only need the indexes if you intend to query the geocell properties in conjunction with a sort order or inequality filter, though - in all other cases, the automatic indexing will suffice. 请注意,在任何一种情况下,如果您打算结合排序顺序或不等式过滤器查询地理单元属性,则只需要索引,但在所有其他情况下,自动索引就足够了。

Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use 最后,如果任何人有任何其他建议或提示可以帮助提高存储利用率,查询性能和/或易用性

The StringListproperty is the way to go for the reasons mentioned above, but in actual usage one might want to add the geocells to ones own previously existing StringList so one could query against multiple properties. StringListproperty是出于上述原因的方法,但在实际使用中,人们可能希望将geocell添加到自己以前存在的StringList,以便可以查询多个属性。

So, if you were to provide a lower level api it could work with full text search implementations like bill katz's 因此,如果您要提供较低级别的api,它可以与bill katz的全文搜索实现一起使用

def point2StringList(Point, stub="blah"):
    .....
    return ["blah_1:a", "blah_2":"a3", "blah_3":"a3f" ....]

def boundingbox2Wheresnippet(Box, stringlist="words", stub="blah"):
    .....
    return "words='%s_3:a3f' AND words='%s_3:b4g' ..." %(stub)

etc.

Looks like you ended up with 13 indices because you encoded in hex (for human readability / map levels?). 看起来你最终得到13个索引,因为你用十六进制编码(人类可读性/地图级别?)。 If you had utilized full potential of a byte (ByteString), you'd have had 256 cells instead of 16 cells per character (byte). 如果你已经利用了一个字节的全部潜力(ByteString),那么每个字符(字节)就有256个单元而不是16个单元。 There by reducing to far fewer number of indices for the same precision. 通过减少相同精度的索引数量减少到更少。

ByteString is just a subclass of a str and is indexed similarly if less than 500bytes in length. ByteString只是str的子类,如果长度小于500字节,则索引类似。

However number of levels might be lower; 但是,级别数可能会更低; to me 4 or 5 levels is practically good enough for most situations on 'the Earth'. 对我来说,4或5级对于“地球”上的大多数情况来说实际上已经足够好了。 For a larger planet or when cataloging each sand particle, more divisions might anyway need to be introduced irrespective of encoding used. 对于较大的行星或对每个砂粒进行编目时,无论使用何种编码,都可能需要引入更多的划分。 In either case ByteString is better than hex encoding. 在任何一种情况下,ByteString都优于十六进制编码。 And helps reduce indexing substantially . 并有助于大幅减少索引。

  • For representing 4 billion low(est) level cells, all we need is 4 bytes or just 4 indices . 为了代表40亿个低(est)级单元,我们只需要4个字节或4个索引 (From basic computer arch or memory addressing). (从基本的计算机拱或内存寻址)。
  • For representing the same, we'd need 16 hex digits or 16 indices . 为了表示相同,我们需要16个十六进制数字或16个索引

I could be wrong. 我错了。 May be the number of index levels matching map zoom levels are more important. 可能是匹配地图缩放级别的索引级别数量更为重要。 Please correct me. 请指正。 Am planning to try this instead of hex if just one (other) person here finds this meaningful :) 如果只有一个(其他)人在这里发现这个有意义的话,我打算尝试这个而不是十六进制

Or a solution that has fewer large cells (16) but more (128,256) as we go down the hierarchy. 或者,当我们沿着层次结构向下时,具有较少大单元(16)但更多(128,256)的解决方案。 Any thoughts? 有什么想法吗?

eg: 例如:

  • [0-15][0-31][0-63][0-127][0-255] gives 1G low level cells with 5 indices with log2 decrement in size. [0-15] [0-31] [0-63] [0-255] [0-255]给出具有5个指数的1G低水平细胞,其大小为log2递减。
  • [0-15][0-63][0-255][0-255][0-255] gives 16G low level cells with 5 indices. [0-2] [0-255] [0-255] [0-255]给出具有5个指数的16G低水平细胞。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM