简体   繁体   English

如何在磁盘上存储巨大的马尔可夫链,同时能够在不使用太多RAM的情况下查询它?

[英]How to store a huge Markov chain on disk, while being able to query it without using too much RAM?

I am representing a Markov chain as a nested data structure, in Python as a dict of dicts of dicts... Eg to understand what I mean, given the sentence 'this is purely an example, this is not serious.' 我将马尔可夫链表示为一个嵌套数据结构,在Python中作为dicts dicts的词典......例如,要明白我的意思,给出句子'this is purely an example, this is not serious.' , I generate all the consecutive pairs and record the token that follows them and their frequencies: ,我生成所有连续的对并记录跟随它们的令牌及其频率:

{',': {'this': {'is': 1}},
 'an': {'example': {',': 1}},
 'example': {',': {'this': 1}},
 'is': {'not': {'serious': 1}, 'purely': {'an': 1}},
 'not': {'serious': {'.': 1}},
 'purely': {'an': {'example': 1}},
 'this': {'is': {'not': 1, 'purely': 1}}}

Then, I can query it using repeated item access. 然后,我可以使用重复的项目访问来查询它。 Eg I can see that after 'this is' there's 'not' or 'purely' , both with frequency 1. 例如,我可以看到,在'this is''not''purely' ,频率为1。

In this contrived example the chain has a state size of 2, but I generate them with states of 3, 4, 5, 6. The text corpus is also huge, and the result is that the dictionary representing the chain takes tens of GB of RAM. 在这个人为的例子中,链的状态大小为2,但我生成状态为3,4,5,6。文本语料库也很庞大,结果是表示链的字典需要几十GB内存。

I was investigating alternative ways to store the Markov chain on disk. 我正在研究将马尔可夫链存储在磁盘上的替代方法。 I've considered Neo4J, but it does not appear very well suited for this specific use case. 我考虑过Neo4J,但它似乎并不适合这个特定的用例。 The same applies to Postgres' ltree structure. 这同样适用于Postgres的ltree结构。 I've then settled on a simple table in a relational database, like the following (state size 4): 然后,我在关系数据库中找到了一个简单的表,如下所示(状态大小为4):

CREATE TABLE chain (
    w1       varchar(20),
    w2       varchar(20),
    w3       varchar(20),
    w4       varchar(20),
    children json,
    PRIMARY KEY(w1, w2, w3, w4)
);

There's a performance tradeoff when constructing the structure, but since it's only paid once it's acceptable. 在构建结构时会有性能权衡,但因为只有在可接受的情况下才会付费。

Are there better way to store big Markov chains on disk, which allows querying without needed huge amounts of RAM? 有没有更好的方法在磁盘上存储大马尔可夫链,这允许查询而不需要大量的RAM?

A Markov Process is in a sense a probabilistic state machine, which satisfies the Markov property (that you can start the state machine from any state so that the past events should not affect the probabilities). 马尔可夫过程在某种意义上是一种概率状态机,它满足马尔可夫属性(你可以从任何状态启动状态机,以便过去的事件不应该影响概率)。

So, you should store a state index, by which you will query, and a Blob or something more descriptive which includes the states to which you can transition to and their probabilities. 因此,您应该存储要查询的状态索引,以及包含可以转换到的状态及其概率的Blob或更具描述性的内容。

When building the state index, you should not use just incremental index, but instead some kind of binary-search-like method, which makes sense in the domain of your machine learning application. 构建状态索引时,不应仅使用增量索引,而应使用某种类似二进制搜索的方法,这在机器学习应用程序的域中是有意义的。

For example, you could have states 1000 1100 0100 and 0000 for "is", "not", "purely" and "this" (I am leaving out ",", "an", "example" for simplicity). 例如,您可以将状态1000 1100 0100和0000表示为“is”,“not”,“purely”和“this”(为简单起见,我要省略“,”,“an”,“示例”)。 Then, the state "this is", would be 0001, the first 00 denoting "this" and the second 01 denoting "is". 然后,状态“this is”将是0001,第一个00表示“this”,第二个01表示“is”。 Here I am assuming, that "this is" will contain full state eg that there will not be another "this is" in your data set. 在这里我假设,“这是”将包含完整状态,例如,您的数据集中不会有另一个“这是”。 If that would be the case, I believe that would be a breach of Markov Property or flaw in your query logic (instead of bigrams you should be querying something else). 如果是这种情况,我认为这将违反Markov Property或查询逻辑中的缺陷(而不是bigrams你应该查询别的东西)。

Anyway, this should be RAM efficient and could enable you many kinds of search strategies. 无论如何,这应该是RAM效率,并可以使您能够实现多种搜索策略。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM