简体   繁体   English

如何在C ++中有效地实现异构不可变对象的不可变图?

[英]How to efficiently implement an immutable graph of heterogenous immutable objects in C++?

I am writing a programming language text parser, out of curiosity. 出于好奇,我正在编写一个编程语言文本解析器。 Say i want to define an immutable (at runtime) graph of tokens as vertices/nodes. 假设我想将标记的不可变(在运行时)图形定义为顶点/节点。 These are naturally of different type - some tokens are keywords, some are identifiers, etc. However they all share the common trait where each token in the graph points to another. 这些自然是不同类型的 - 一些标记是关键字,一些是标识​​符等。但是它们都共享共同特征,其中图形中的每个标记指向另一个标记。 This property lets the parser know what may follow a particular token - and so the graph defines the formal grammar of the language. 此属性允许解析器知道特定标记后面的内容 - 因此图形定义了语言的正式语法。 My problem is that I stopped using C++ on a daily basis some years ago, and used a lot of higher level languages since then and my head is completely fragmented with regards to heap-allocation, stack-allocation and such. 我的问题是几年前我每天都停止使用C ++,并且从那时起使用了很多更高级的语言,而且我的头部在堆分配,堆栈分配等方面完全分散。 Alas, my C++ is rusty. 唉,我的C ++生锈了。

Still, I would like to climb the steep hill at once and set for myself the goal of defining this graph in this imperative language in a most performant way. 不过,我想立刻爬上陡峭的山坡,为自己设定以最高效的方式用这种命令式语言定义这个图形的目标。 For instance I want to avoid allocating each token object separately on the heap using 'new' because I think if I allocate the entire graph of these tokens back-to-back so to speak (in a linear fashion like elements in an array), this would benefit the performance somehow, per locality of reference principle - I mean when the entire graph is compacted to take up minimal space along a 'line' in memory, rather than having all its token objects at random locations, that is a plus? 例如,我想避免使用'new'在堆上单独分配每个令牌对象,因为我认为如果我将这些令牌的整个图形背靠背地分配(以线性方式像数组中的元素一样),根据参考原理的每个位置,这将有利于性能 - 我的意思是当整个图形被压缩以沿着内存中的“线”占据最小空间,而不是将所有其令牌对象放在随机位置时,这是一个加号? Anyway, like you see, this is a bit of a very open question. 无论如何,就像你看到的,这是一个非常开放的问题。

class token
{

}

class word: token
{
    const char* chars;

    word(const char* s): chars(s)
    {
    }
}

class ident: token
{
    /// haven't thought about these details yet
}

template<int N> class composite_token: token
{
    token tokens[N];
}

class graph
{
    token* p_root_token;
}

The immediate question is: what would be the procedure to create this graph object? 当前的问题是:创建此图形对象的过程是什么? It's immutable and it's thought structure is known at compile time, that's why I can and want to avoid copying stuff by value and so on - it should be possible to compose this graph out of literals? 它是不可变的,它的思想结构在编译时是已知的,这就是为什么我可以并且想要避免按值复制东西等等 - 应该可以用文字组成这个图形吗? I hope I am making sense here... (wouldn't be the first time I didn't.) The graph will be used by the parser at runtime as part of a compiler. 我希望我在这里有意义......(这不是我第一次没有。)解析器在运行时将使用该图作为编译器的一部分。 And just because this is C++, I would be happy with a C solution as well. 仅仅因为这是C ++,我也会对C解决方案感到满意。 Thank you very much in advance. 非常感谢你提前。

My C++ is rusty as well, so I probably don't know the best solution for this. 我的C ++也生锈了,所以我可能不知道最好的解决方案。 But since nobody else stepped forward... 但是,因为没有其他人上前......

You are right in that allocating all nodes in one block would give you the best locality. 你是对的,在一个块中分配所有节点会给你最好的位置。 However, if you dynamically allocate the graph at program start, chances are that your heap allocations will also cluster together closely. 但是,如果在程序启动时动态分配图形,则堆分配也可能会紧密聚集在一起。

To allocate all nodes in a single memory block, two possibilities come to my mind: create and populate a Vector<> at startup (with the drawback that now you have the graph information twice in memory), or use a static array initializer "Node[] graph = { ... };" 要在单个内存块中分配所有节点,我想到了两种可能性:在启动时创建并填充Vector <>(缺点是现在您在内存中有两次图形信息),或者使用静态数组初始化程序“Node [] graph = {...};“ .

For either approach, the biggest obstacle is that you want to create your graph of heterogenous objects. 对于这两种方法,最大的障碍是您想要创建异质对象的图形。 One obvious solution is "Don't": you could make your node a superset of all possible fields, and distinguishing the types with an explicit 'type' member. 一个显而易见的解决方案是“不要”:您可以使您的节点成为所有可能字段的超集,并使用显式“类型”成员区分类型。

If you want to keep the various node classes, you will have to use multiple arrays/vectors: one for each type. 如果要保留各种节点类,则必须使用多个数组/向量:每种类型一个。

Either way, the connections between the nodes will have to be initially defined in terms of array indices (Node[3] is followed by Node[10]). 无论哪种方式,节点之间的连接必须首先根据数组索引定义(Node [3]后跟Node [10])。 For better parsing performance, you could create direct object pointers at program startup based on these indices, of course. 为了获得更好的解析性能,您可以在程序启动时根据这些索引创建直接对象指针。

I would not put literal strings into any node ('word' in your case): the recognition of keywords, identifiers and other lexical elements should be done in a lexer module separate from the parser. 我不会将文字字符串放入任何节点(在您的情况下为“word”):关键字,标识符和其他词汇元素的识别应该在与解析器分开的词法模块中完成。 I think it would also help if you distinguish in terminalogy between the tokens generated by the Lexer based on the program's input, and the grammar graph nodes your program uses to parse the input. 我认为如果你根据程序的输入区分Lexer生成的标记和程序用来解析输入的语法图节点,那么它也会有所帮助。

I hope this helps. 我希望这有帮助。

I don't see how you will define a "graph" of tokens that defines the syntax of any practical programming language, especially if the relation betweens tokens is "allowed-to-follow". 我不知道你将如何定义一个定义任何实用编程语言语法的标记“图形”,特别是如果标记之间的关系是“允许遵循”。

The usual way to represent the grammar of programming language is using Backus-Naur Form (BNF) or Extended versions of this termed "EBNF". 表示编程语言语法的常用方法是使用Backus-Naur Form(BNF)或称为“EBNF”的扩展版本。

If you wanted to represent an EBNF ("as an immutable graph"), this SO answer discusses how to do that in C#. 如果你想代表一个EBNF(“作为一个不可变图”),这个SO答案讨论了如何在C#中做到这一点。 The ideas have direct analogs in C++. 这些想法在C ++中有直接的类比。

The bad news is that most parsing engines can't use the EBNF for directly because it is simply too inefficient in practice. 坏消息是大多数解析引擎都不能直接使用EBNF,因为它在实践中效率太低。 It is hard to build an efficient parser using the direct representation of the grammar rules; 使用语法规则的直接表示很难构建有效的解析器; this is why people invented parser generators. 这就是人们发明解析器生成器的原因。 So the need to put these rules into a memory structure at all, let alone an "efficient" one, is unclear unless you intend to write a parser generator. 因此,除非您打算编写解析器生成器,否则将这些规则放入内存结构中的需要,更不用说“高效”了。

Finally, even if you do pack the grammar-information somehow optimally, it probably won't make an ounce of difference in actual performance. 最后,即使你以某种方式最佳地打包语法信息,它也可能不会在实际性能上产生一点差异。 Most of a parser's time is spent in grouping characters in lexemes, sometime even to the point of just doing blank supression. 解析器的大部分时间花在将字符分组为lexemes,有时甚至只是做空白抑制。

I don't think many small allocations of the tokens will be a bottleneck, if it does you can always choose a memory pool. 我不认为令牌的许多小分配将成为瓶颈,如果确实如此,你总是可以选择一个内存池。

Onto the problem; 在问题上; since all tokens have similar data (having a pointer to the next, and perhaps some enum value for what token we're dealing with) you could put the similar data in one std::vector. 因为所有令牌都有类似的数据(指向下一个,并且可能是我们正在处理的令牌的枚举值),你可以将类似的数据放在一个std :: vector中。 This will be continuous data in memory, and very efficient to loop over. 这将是内存中的连续数据,并且非常有效地循环。

While looping, you retrieve the kind of information you need. 循环时,您可以检索所需的信息类型。 I bet the tokens themselves would ideally only contain "actions" (member-functions), such as: if previous and next tokens are numbers, and I'm a plus sign, we should add the numbers together. 我敢打赌,令牌本身理想情况下只包含“动作”(成员函数),例如:如果前一个和下一个令牌是数字,而我是一个加号,我们应该将这些数字加在一起。

So, the data is stored in one central place, the tokens are allocated (but might not contain much data themselves actually) and work onto the data at the central place. 因此,数据存储在一个中心位置,令牌被分配(但实际上可能不包含太多数据)并处理中心位置的数据。 This is actually a data-oriented design. 这实际上是一种面向数据的设计。

The vector could look like: 矢量可能看起来像:

struct TokenData
{
    token *previous, *current, *next;
    token_id id; // some enum?
    ... // more data that is similar
}

std::vector<TokenData> token_data;

class token
{
    std::vector<TokenData> *token_data;
    size_t index;

    TokenData &data()
    {
        return (*token_data)[index];
    }

    const TokenData &data() const
    {
        return (*token_data)[index];
    }
}

// class plus_sign: token
// if (data().previous->data().id == NUMBER && data().next->data().id == NUMBER)

for (size_t i = 0; i < token_data.size(); i++)
{
    token_data[i].current->do_work();
}

It's an idea. 这是个主意。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM