简体   繁体   中英

App Engine: 13 StringPropertys vs. 1 StringListProperty (w.r.t. indexing/storage and query performance)

A bit of background first: GeoModel is a library I wrote that adds very basic geospatial indexing and querying functionality to App Engine apps. It is similar in approach to geohashing. The equivalent location hash in GeoModel is called a 'geocell.'

Currently, the GeoModel library adds 13 properties (location_geocell__n_, n =1..13) to each location-aware entity. For example, an entity can have property values such as:

location_geocell_1 = 'a'
location_geocell_2 = 'a3'
location_geocell_3 = 'a3f'
...

This is required in order to not use up an inequality filter during spatial queries.

The problem with the 13-properties approach is that, for any geo query an app would like to run, 13 new indexes must be defined and built. This is definitely a maintenance hassle, as I've just painfully realized while rewriting the demo app for the project. This leads to my first question:

QUESTION 1: Is there any significant storage overhead per index? ie if I have 13 indexes with n entities in each, versus 1 index with 13n entities in it, is the former much worse than the latter in terms of storage?

It seems like the answer to (1) is no, per this article , but I'd just like to see if anyone has had a different experience.

Now, I'm considering adjusting the GeoModel library so that instead of 13 string properties, there'd only be one StringListProperty called location_geocells, ie:

location_geocells = ['a', 'a3', 'a3f']

This results in a much cleaner index.yaml . But, I do question the performance implications:

QUESTION 2: If I switch from 13 string properties to 1 StringListProperty, will query performance be adversely affected; my current filter looks like:

query.filter('location_geocell_%d =' % len(search_cell), search_cell)

and the new filter would look like:

query.filter('location_geocells =', search_cell)

Note that the first query has a search space of _n_ entities, whereas the second query has a search space of _13n_ entities.

It seems like the answer to (2) is that both result in equal query performance, per tip #6 in this blog post , but again, I'd like to see if anyone has any differing real-world experiences with this.

Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use (specifically wrt index.yaml), please do let me know! The source can be found here geomodel & geomodel.py

You're correct that there's no significant overhead per-index - 13n entries in one index is more or less equivalent to n entries in 13 indexes. There's a total index count limit of 100, though, so this eats up a fair chunk of your available indexes.

That said, using a ListProperty is definitely a far superior approach from usability and index consumption perspectives. There is, as you supposed, no performance difference between querying a small index and a much larger index, supposing both queries return the same number of rows.

The only reason I can think of for using separate properties is if you knew you only needed to index on certain levels of detail - but that could be accomplished better at insert-time by specifying the levels of detail you want added to the list in the first place.

Note that in either case you only need the indexes if you intend to query the geocell properties in conjunction with a sort order or inequality filter, though - in all other cases, the automatic indexing will suffice.

Lastly, if anyone has any other suggestions or tips that can help improve storage utilization, query performance and/or ease of use

The StringListproperty is the way to go for the reasons mentioned above, but in actual usage one might want to add the geocells to ones own previously existing StringList so one could query against multiple properties.

So, if you were to provide a lower level api it could work with full text search implementations like bill katz's

def point2StringList(Point, stub="blah"):
    .....
    return ["blah_1:a", "blah_2":"a3", "blah_3":"a3f" ....]

def boundingbox2Wheresnippet(Box, stringlist="words", stub="blah"):
    .....
    return "words='%s_3:a3f' AND words='%s_3:b4g' ..." %(stub)

etc.

Looks like you ended up with 13 indices because you encoded in hex (for human readability / map levels?). If you had utilized full potential of a byte (ByteString), you'd have had 256 cells instead of 16 cells per character (byte). There by reducing to far fewer number of indices for the same precision.

ByteString is just a subclass of a str and is indexed similarly if less than 500bytes in length.

However number of levels might be lower; to me 4 or 5 levels is practically good enough for most situations on 'the Earth'. For a larger planet or when cataloging each sand particle, more divisions might anyway need to be introduced irrespective of encoding used. In either case ByteString is better than hex encoding. And helps reduce indexing substantially .

  • For representing 4 billion low(est) level cells, all we need is 4 bytes or just 4 indices . (From basic computer arch or memory addressing).
  • For representing the same, we'd need 16 hex digits or 16 indices .

I could be wrong. May be the number of index levels matching map zoom levels are more important. Please correct me. Am planning to try this instead of hex if just one (other) person here finds this meaningful :)

Or a solution that has fewer large cells (16) but more (128,256) as we go down the hierarchy. Any thoughts?

eg:

  • [0-15][0-31][0-63][0-127][0-255] gives 1G low level cells with 5 indices with log2 decrement in size.
  • [0-15][0-63][0-255][0-255][0-255] gives 16G low level cells with 5 indices.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM