简体   繁体   中英

Spring Data GemFire custom partition and performance

We are using Spring Data GemFire server, client and locator. All of our GemFire PARTITION Regions have complex keys.

For example:

class Key { 
  String id1;
  String id2;
  Date date;
}

We would like to create a custom partition based on this entire key. In the getObject() method we are planning to return a | delimited string of these 3 fields.

Is this is a best practice or is there any other way to return the object?

We are also planning to create key indexes and in this case we will have to create indexes individually on Key.id1 and Key.id2 and Key.date as our searches will based on the key dates and key id1, id2.

Is this a right way to create the key index for improving the performance?

Based on GemFire documentation, we are planning to use Functions to improve the performance. In the Filter argument for search to happen in specific partition

Do we just need to send the complex object or whatever partition logic we have added in getObject passed in the filter set?

First of all, this problem is independent of whether you started your GemFire (data) servers using Spring Data GemFire (SDG) or not, such as by using Gfsh . Having said that, there are significant advantages to using Spring , and specifically SDG, to bootstrap and configure your servers, Locators, and clients. But, I simply wanted to make this distinction where this problem is concerned for other interested readers.

By getObject() method, I assume your are actually referring to PartitionResolver.getRoutingObject() ? See Javadoc .

In general, I'd say it is nearly always preferable to use simple, scalar types as keys in your Regions , such as Long , Integer , String , etc. Most searching should be based on the value, or properties of the value (ie Object) rather than individual components (eg id1 ) of the key.

Additionally, I will also point out that I disagree with the PartitionResolver Javadoc , bullet #1, where it states, " The key class can implement the PartitionResolver interface to enable custom partitioning ". I think this is a naive approach for many reasons, not the least of which is it couples your key class to GemFire. You should always prefer #2 when a PartitionResolver is needed.

But is a PartitionResolver actually needed in your case?

Since your "entire" key defines the "route" (ie all properties [ id1 , id2 , date ] of the Key class), you don't even really need to involve a custom PartitionResolver at all.

All you simply need to do is provide a proper implementation of the Object equals(:Object) and hashCode() methods in your Key class.

TIP: Keep in mind that GemFire Regions at a basic, fundamental level, are simply a java.util.Map , key-value data structure. Yes, they are distributed (in most cases) as well as partitioned for the PARTITION Regions , but it is fundamentally based on a Map and the "hash" of your key. If your entire key defines the partition (or route), then no custom PartitionResolver is necessary.

TIP: Furthermore, a PARTITION Region is a logical Region that is divided up into 113 buckets (by default, ignoring primaries & secondaries for a moment) and those buckets are distributed across the (data-hosting) servers in your cluster, making the Region physically dispersed, of course, assuming your servers are individual processes on separate machines. This is what constitutes a "logical" Region , because to your application, it is simply 1 wholistic data structure. Anyway.

You would implement a custom PartitionResolver if a portion of the key was used to determine the partition (or route) or the key/value pairing. This is useful if you want to group certain key/value pairings together, at the same physical location (ie server/process & machine in the cluster).

For example, suppose you want to group similar key/value pairings based on the date of your key. Then...

class KeyDatePartitionResolver implements PartitionResolver { 

  public String getName() {
    return getClass().getName();
  }

  public Object getRoutingObject(EntryOperation<Key, Object> entryOp) {
    Key key = entryOp.getKey();
    return key.getDate();
  }
}

Now all entries (key/values) that occurred on a similar date/time would be routed to the same partition (or bucket) in the logical PARTITION Region . Of course, you could further filter the date to group, or route the key/value pairings based on year/month/day or simply year/month, however you choose. Again, all that matters is that the Object returned from the getRoutingObject(..) method in your custom PartitionResolver implements the equals(:Object) and hashCode() methods. Obviously, Java's java.util.Date class ( Javadoc ) does.

Regarding...

" Is this a right way to create the key index for improving the performance? "

Well, it depends on your application search cases. Are your search cases for certain values based on the components (ie [ id1 , id2 , date ]) of the key collectively or individually?

For example, if you search by the combinations [ id1 , date ] as well as [ id2 , date ] then you would create 2 (KEY) Indexes with these fields from the Key class. If you searched by all 3 fields [ id1 , id2 , date ], then your (KEY) Index would include all 3 fields. If you searched by all 3 combinations, when you would (generally) need all 3 KEY Indexes for optimal performance.

Essentially, a field or combination of fields used in a query predicate expression should be indexed for potentially more optimal performance.

There is no guarantee though, either. Remember, when values change (are added, updated, removed, etc) Indexes need to be updated to some degree. Therefore, there are "maintenance costs" associated with Indexes and the more you have, the more it can potentially cost.

You also have to weigh the benefit between the number of key/value pairings and whether a Index is warranted at all. If the data is mostly referential in nature, with a relatively small data set (eg < 1000 entries, perhaps), then sometimes a full scan can still be more efficient in performance than when using Index . A full scan is equivalent to a full table scan in an RDBMS. Just remember, Indexes are not free. They take up space (memory) and time (CPU) to maintain.

I'd also say, it is generally better to (again) use simple keys and maintain "searchable" state in the values associated with the keys. This boils downs to design preference, though. Use (simple) keys for partitioning/routing.

For additional (and relevant) information, see: here , here , here , and here .

Lastly, regarding Functions , the filter is a set of "keys" ( Javadoc ). The keys are used to find, or route to the (bucket of the) partition in the logical, PARTITION Region .

If you also configured a custom PartitionResolver with the PARTITION Region , I believe it will also apply the resolver to the filtered (or set of keys) passed to the Function when the Function is executed.

But, you are simply passing the entire key, which in your case is an instance of your Key class, where you can pass multiple instances (hence, the " Set ") depending on which keys you want to filter by.

Anyway, I hope this all makes sense.

As always, when these sort of questions or asked, it varies significantly based on your UC (or data access patterns), requirements, data set. The proper thing to do here is try things and test.

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM