简体繁体 English

LRU缓存对Trie数据结构如何起作用？

[英]How would an LRU cache work for a trie data structure?

原文 2019-07-05 06:48:36 1 1 java/ tree/ prefix/ trie/ lru

Let's say I have a trie/prefix trie with a total limit of 10 nodes. 假设我有一个trie / prefix trie，总共限制为10个节点。 I'm limiting to 10 nodes to simulate memory being exceeded. 我限制为10个节点以模拟超出的内存。 (If I cannot load the entire tree into memory, I have total - 10 nodes stored on disk. （如果无法将整个树加载到内存中，则总共-磁盘上存储了10个节点。

I now insert a new string into the trie that will cause the tree to exceed the 10 node limit, so now it's time for the LRU cache to evict the least recently accessed node from the trie. 现在，我在Trie中插入一个新字符串，这将导致树超过10个节点的限制，因此现在是时候让LRU缓存从Trie中退出最近访问最少的节点了。

Let's say the tree contains the words hello, help, hi and the LRU node is “h”. 假设树包含单词hello，help，hi，并且LRU节点为“ h”。 This would mean I need to delete “h” from the trie, which will delete the entire tree in this case. 这意味着我需要从特里删除“ h”，在这种情况下，这将删除整个树。 My confusion lies in also updating the cache itself to delete all the children. 我的困惑在于还更新了缓存本身以删除所有子项。 How does this work in this case? 在这种情况下如何运作？

I assume the cache has nodes like “h”, “he”, “hel”, “help”, etc. If I delete the “h” node, I assume I need to delete everything in the cache prefixed with “h”? 我假设高速缓存具有“ h”，“ he”，“ hel”，“ help”等节点。如果删除“ h”节点，我假设需要删除高速缓存中以“ h”为前缀的所有内容？ My entire assumption seems really inefficient. 我的整个假设似乎效率很低。

1 个解决方案

One thing to keep in mind when talking about cache is that it is a redundant data structure, whose only goal is to speed-up data fetches. 在谈论高速缓存时，要记住的一件事是，它是一个冗余的数据结构，其唯一目的是加快数据的获取速度。
So, when a piece of data is evicted from the cache, it has no consequence (other than the execution speed) on the program which uses this data, because it will then be fetched from the main memory. 因此，当从高速缓存中逐出一条数据时，它对使用该数据的程序没有任何影响（执行速度除外），因为它随后将从主存储器中获取。 So, in any case, your trie will have the exact same behavior, regardless of which piece of it is located in the cache or not. 因此，无论如何，您的特里树都将具有完全相同的行为，而不管它的哪一部分位于缓存中。

This is very important, because it allows us to code in high level languages, such as java, without caring about the replacement policy of the cache implemented by the processor. 这非常重要，因为它允许我们使用高级语言（例如Java）进行编码，而无需关心处理器实现的缓存的替换策略。 If it was not the case, it would be a nightmare, because we would have to take into account all the existing (and future?) replacement policy implemented in processors. 如果不是这种情况，那将是一场噩梦，因为我们必须考虑到处理器中实施的所有现有（以及将来的？）更换政策。 Not even mentioning that these policies are not as simple as LRU (there are cache sets, which divide cache into 'lines', and their behavior is pretty much linked to their physical structure as well), and that the place a piece of data will be located in the cache depends on its address in the main memory, which will not necessarily be the same for each code execution. 甚至没有提到这些策略不像LRU那样简单（存在缓存集，将缓存分为“行”，并且它们的行为也与它们的物理结构密切相关），并且放置了一部分数据高速缓存中的地址取决于其在主存储器中的地址，对于每个代码执行而言，地址不一定相同。

In short, the two things you mention (trie nodes in java, and LRU cache policies) are too far apart (one is very, very low level programming, the other high level). 简而言之，您提到的两件事（java中的trie节点和LRU缓存策略）相距太远（一件事是非常非常低级的编程，另一件事是非常高级的编程）。 That is why we rarely, if ever, consider their interactions. 这就是为什么我们很少（如果有的话）很少考虑它们的相互作用。
If you implement a Trie in java, your job is to be sure that it works well in all situations, that it is well designed so maintenance will be easier (possible), that it is readable so other programmers can work on it some day. 如果您使用Java实现Trie，则您的工作是确保它在所有情况下均能正常工作，并且设计合理，因此维护会更容易（可能），可读性强，以便其他程序员有一天可以使用它。 Eventually, if it still runs too slow, you can try and optimize it (after determining where the bottlenecks are, never before). 最终，如果它仍然运行太慢，您可以尝试对其进行优化（确定瓶颈所在之后，再也没有）。
But if you want to link your trie to the cache hit/miss, and replacement policies, you will have to read the translation of your implementation in bytecode (done by the JVM). 但是，如果要将特里链接到缓存命中/未命中以及替换策略，则必须读取字节码（由JVM完成）的实现翻译。

PS: in your post, you talk of simulating memory being execeeded. PS：在您的文章中，您谈到模拟正在执行的内存。 There is no such thing for a program. 程序没有这样的东西。 When the cache is full, we fill up the main memory. 当缓存已满时，我们将填充主内存。 When the main memory is full, operating systems usually reserve a part of the hard drive to play the role of the main memory (we call it swapping, and when it happens, the computer is as good as frozen). 当主内存已满时，操作系统通常会保留一部分硬盘驱动器以发挥主内存的作用（我们称其为交换，当发生交换时，计算机的性能与冻结一样好）。 When the swap is full, programs crash. 当交换已满时，程序崩溃。 All of them. 他们全部。
In the 'mind' of a program, the operating system gives it absolutely gigantic amounts of memory (which is virtual, but for the program it's as good as real one), that will never be filled up. 在程序的“思想”中，操作系统为它提供了绝对巨大的内存量（这是虚拟的，但对于程序来说，它与真实的一样好），永远不会被填满。 The program itself is not 'conscious' of the way memory is managed, and of the amount of memory left, for a lot of good reasons (security, guarantee that all programs will have a fair share of the ressources ...) 程序本身并不“知道”内存的管理方式以及剩余的内存量，原因有很多（安全性，请确保所有程序在资源中都有合理的份额……）