简体繁体 English

使用 ehcache 的倒排索引

[英]Inverted index with ehcache

原文 2020-05-13 07:01:25 0 1 java/ indexing/ ehcache/ terracotta/ inverted-index

Lets say I want to create an inverted index on a document with 4 unique words in it.假设我想在一个包含 4 个唯一单词的文档上创建一个倒排索引。 It will look like word1 -> document, word2 -> document, word3 -> document, word4 -> document .它看起来像word1 -> document, word2 -> document, word3 -> document, word4 -> document 。 Using a size limited ehcache cache along with a terracotta cluster I can put all four associations separately in the cache.使用大小有限的 ehcache 缓存和陶土集群，我可以将所有四个关联分别放在缓存中。

But here's what I'm wondering about: Would the cache maintain one copy of the document or would it store four of those?但这是我想知道的：缓存会保留一份文档副本还是存储其中四个？ My guess is it'd be four serialised copies (which is undesirable for my case).我的猜测是它将是四个序列化的副本（这对我来说是不可取的）。 If that's true, what's a better way to do this?如果这是真的，有什么更好的方法来做到这一点？

1 个解决方案

You are correct that any storage layer in Ehcache, with the exception of the in memory one will use a serialized version and thus your document will be duplicated effectively.您是正确的，Ehcache 中的任何存储层，除了 memory 中的存储层将使用序列化版本，因此您的文档将被有效复制。

As suggested in a comment, you could add a level of indirection between the words and the document.正如评论中所建议的，您可以在单词和文档之间添加一定程度的间接性。 You could also only store an ID in the cache and have the document leave elsewhere.您也可以只在缓存中存储一个 ID，然后将文档留在其他地方。

What is clear is that with direct mappings you should not rely on modifications done on the document of one mapping to be visible to the other mappings.清楚的是，使用直接映射，您不应依赖对一个映射的文档所做的修改才能对其他映射可见。 You would be abusing the cache.你会滥用缓存。