Implementing a Mondrian shared SegmentCache

Question

I am trying to implement a Mondrian SegmentCache . The cache is to be shared by multiple JVMs running the Mondrian library. We are using Redis as the backing store, however for the purpose of this question, any persistent key-value store should be fine.

Will the stackoverflow community help complete this implementation? The documentation and Google searches are not yielding enough level of detail. Here we go:

new SegmentCache {

    private val logger = Logger("my-segment-cache")
    import logger._

    import com.redis.serialization.Parse
    import Parse.Implicits.parseByteArray
    private def redis = new RedisClient("localhost", 6379)

    def get(header: SegmentHeader): SegmentBody = {
        val result = redis.get[Array[Byte]](header.getUniqueID) map { bytes ⇒
            val st = new ByteArrayInputStream(bytes)
            val o = new ObjectInputStream(st)
            o.readObject.asInstanceOf[SegmentBody]
        }
        info(s"cache get\nHEADER $header\nRESULT $result")
        result.orNull
    }

    def getSegmentHeaders: util.List[SegmentHeader] = ???

    def put(header: SegmentHeader, body: SegmentBody): Boolean = {
        info(s"cache put\nHEADER $header\nBODY $body")
        val s = new ByteArrayOutputStream
        val o = new ObjectOutputStream(s)
        o.writeObject(body)
        redis.set(header.getUniqueID, s.toByteArray)
        true
    }

    def remove(header: SegmentHeader): Boolean = ???

    def tearDown() {}

    def addListener(listener: SegmentCacheListener) {}

    def removeListener(listener: SegmentCacheListener) {}

    def supportsRichIndex(): Boolean = true
}

Some immediate questions:

is SegmentHeader.getUniqueID the appropriate key to use in the cache?
how should getSegmentHeaders be implemented? The current implementation above just throws an exception, and doesn't seem ever be called by Mondrian. How do we make the SegmentCache re-use existing cache records on startup?
how are addListener and removeListener meant to be used? I assume they have something to do with coordinating cache changes across nodes sharing the cache. But how?
what should supportsRichIndex return? In general, how does someone implementing a SegmentCache know what value to return?

I feel like these are basic issues that should be covered in the documentation, but they are not (as far as I can find). Perhaps we can correct the lack of available information here. Thanks!

Answer 1

is SegmentHeader.getUniqueID the appropriate key to use in the cache?

Yes and no. The UUID is convenient on systems like memcached, where everything boils down to a key/value match. If you use the UUID, you'll need to implement supportsRichIndex() as false. The reason for this is that excluded regions are not part of the UUID. That's on design for good reasons.

What we recommend is an implementation that serializes the SegmentHeader (it implements Serializable and hashCode() & equals()) and use that directly as a binary key that you propagate, so that it will retain the invalidated regions and keep everything nicely in sync.

You should look at how we've implemented it in the default memory cache .

There is also an implementation using Hazelcast .

We at Pentaho have also used Infinispan with great success.

how should getSegmentHeaders be implemented?

Again, take a look at the default in-memory implementation. You simply need to return the list of all the currently known SegmentHeader. If you can't provide that list for whatever reason, either because you've used the UUID only, or because your storage backend doesn't support obtaining a list, like memcached, you return an empty list. Mondrian won't be able to use in-memory rollup and won't be able to share the segments, unless it hits the right UUIDs in cache.

how are addListener and removeListener meant to be used?

Mondrian needs to be notified when new elements appear in the cache. These could be created by other nodes. Mondrian maintains an index of all the segments it should know about (thus enabling in-memory operations), so that's a way to propagate the updates. You need to bridge the backend with the Mondrian instances here. Take a look at how the Hazelcast implementation does it .

The idea behind this is that Mondrian maintains a spatial index of the currently known cells and will only query the necessary/missing cells from SQL if it absolutely needs to. This is necessary to achieve greater scalability. Fetching cells from SQL is extremely slow compared to objects which we maintain in an in-memory data grid.

How do we make the SegmentCache re-use existing cache records on startup

This is a caveat. Currently this is possible by applying this patch . It wasn't ported to the master codeline because it is a mess and is tangled with the fixes for another case. It has been reported to work, but wasn't tested internally by us. The relevant code is about here . If you get around to testing this, we always welcome contributions. Let us know if you're interested on the mailing list . There are a ton of people who will gladly help.

One workaround is to update the local index through the listener when your cache implementation starts.

Implementing a Mondrian shared SegmentCache

Question

1 answers

solution1
2 2013-07-09 02:08:09

Implementing a Mondrian shared SegmentCache

Question

1 answers

solution1 2 2013-07-09 02:08:09

solution1
2 2013-07-09 02:08:09