当字段可为空时，如何使用 C++ 接口在 Avro 中写入数据？

Question

First, I conducted a search for this question.首先，我对这个问题进行了搜索。 I found an answer for the C interface and one for Java.我找到了 C 接口的答案和 Java 的答案。 Didn't find one for C++.没有找到 C++ 的。 Unfortunately the methods invoked in the C example don't exist in the C++ API, so one couldn't merely mimic the answer provided in that particular stackoverflow discussion/topic.不幸的是，C 示例中调用的方法在 C++ API 中不存在，因此不能仅仅模仿该特定 stackoverflow 讨论/主题中提供的答案。

I am attempting something that should be rather simple.我正在尝试一些应该相当简单的事情。 Yet after an hour or two I have only managed to get closer to an answer and still haven't found one yet.然而，一两个小时后，我才设法接近答案，但还没有找到答案。 In the interest of simplicity, I reduced the record that I am attempting to write to only 1 field.为了简单起见，我将尝试写入的记录减少到仅 1 个字段。 That field is a string that can be null.该字段是一个字符串，可以是 null。 In Avro this means that the field is optional.在 Avro 中，这意味着该字段是可选的。 The null aspect of the field is accomplished through an Avro union, where the convention is that the null value comes first in the schema for that field.该字段的 null 方面是通过 Avro 联合完成的，其中约定是 null 值首先出现在该字段的架构中。

What I've learned thus far from a considerable amount of trial and error:到目前为止，我从大量的试验和错误中学到了什么：

You need an encoder and decoder within a templated codec_traits struct for the record you want to write.您需要在模板化 codec_traits 结构中的编码器和解码器用于您要写入的记录。 This is typically defined in a header somewhere.这通常在某处的 header 中定义。
If loading the schema from a file, which I am doing, then you need that schema defined in JSON format in a separate file.如果从我正在执行的文件中加载架构，那么您需要在单独的文件中以 JSON 格式定义的架构。
In your C++ code, you declare an avro::DataFileWriter using the schema that you load, along with a record from the aforementioned header.在您的 C++ 代码中，您使用您加载的模式声明一个 avro::DataFileWriter 以及来自上述 header 的记录。 You then have a local record that you populate with your data and then you invoke the write() method.然后，您有一个用数据填充的本地记录，然后调用 write() 方法。

Should be simple enough.应该足够简单。 Yet not so much.然而并没有那么多。 For the particulars per the above list, the following comprise the code that I am currently using:对于上述列表中的详细信息，以下包含我当前使用的代码：

The header: header：

    #ifndef RECURSIVE_HH
    #define RECURSIVE_HH
    
    #include "Specific.hh"
    #include "Encoder.hh"
    #include "Decoder.hh"
    
    namespace recursive_record
    {
       struct recursive_data
       {
          std::string   fstring;
    
       };
    }
    
    namespace avro
    {
       template<> struct codec_traits<recursive_record::recursive_data>
       {
          static void encode( Encoder& e, const recursive_record::recursive_data& v )
          {
             avro::encode( e, v.fstring );
    
          }
    
          static void decode( Decoder& d, recursive_record::recursive_data& v )
          {
             avro::decode( d, v.fstring );
    
          }
       };
    }
    
    #endif /* RECURSIVE_HH */

The JSON schema file: JSON 模式文件：

    {
        "type": "record",
        "name": "Root",
        "fields": [
            {
                "name": "fstring",
                "type": [
                    "null",
                    "string"
                ]
            }
        ]
    }

The main C++ file (note that I have snipped the file for brevity reasons, thus some of the included headers aren't used (or rather seen) in the following code:主要的 C++ 文件（请注意，出于简洁的原因，我已经截断了该文件，因此以下代码中没有使用（或者更确切地说是看到）一些包含的标头：

    #include "recursive.h"
    #include "Encoder.hh"
    #include "Decoder.hh"
    #include "Generic.hh"
    #include "GenericDatum.hh"
    #include "ValidSchema.hh"
    #include "DataFile.hh"
    #include "Types.hh"
    #include "Compiler.hh"
    #include "Stream.hh"
    
    avro::ValidSchema loadSchema(const char* filename)
    {
        std::ifstream ifs(filename);
        avro::ValidSchema result;
        avro::compileJsonSchema(ifs, result);
        return result;
    }
    
    
    int main( int argc, char** argv )
    {
       /**********************************************************************************
                                  AVRO WRITER EXAMPLE
       **********************************************************************************/
       try
       {
          //Filename definitions skipped for brevity
    
          avro::ValidSchema          recursiveSchema = loadSchema( schemaFilename );
          avro::DataFileWriter<recursive_record::recursive_data>   dfw( filename, recursiveSchema );
          recursive_record::recursive_data       record;
          record.fstring = std::string("First string");
    
          dfw.write( record );
          dfw.close();
    
       }
       catch( const std::exception& e )
       {
          // Log a message
          return -1;
    
       }
    }

"So what's the problem?" “所以有什么问题？” you might ask.你可能会问。 Well, the file is actually written successfully, at least in that the code doesn't crash and an Avro data file is produced.好吧，文件实际上是成功写入的，至少代码没有崩溃并且生成了一个 Avro 数据文件。 So far, so good.到目前为止，一切都很好。 However, if you attempt to read that file, then you receive the following error:但是，如果您尝试读取该文件，则会收到以下错误：

    AVRO read error: vector::_M_range_check: __n (which is 12) >= this->size() (which is 2)

Wha-???世界卫生大会-？？？ Yeah.是的。 'Been working on this all afternoon. '整个下午都在做这件事。

After considerable experimentation, I discovered that the problem was due to this nullable aspect of a given field.经过大量实验后，我发现问题出在给定字段的可空方面。 I also noticed that if I removed the nullable option from the schema, so that the schema becomes this:我还注意到，如果我从架构中删除了可为空的选项，那么架构就变成了这样：

    {
        "type": "record",
        "name": "Root",
        "fields": [
            {
                "name": "fstring",
                "type": "string"
            }
        ]
    }

And I change nothing else , then the new Avro data file is not only written successfully, but it is read successfully too, thus:我什么也没做，那么新的 Avro 数据文件不仅写入成功，而且读取也成功，因此：

    [rh6lgn01][1881] MY_EXAMPLES/generate_recursive$ recursive
    schema=recursive.json
    file=./DATA/recursive.avro
    recursiveSchema valid = true
    ReadFile(): Type = record
    ProcessRecord(): New record found.  Field count = 1
    ProcessRecord(): {
    ProcessRecord():   Field 0: type = string
    ProcessDatum():   Field 0: value = First string (length= 12)
    ProcessRecord(): }
    rowCount = 1
    
    AVRO Writing and Reading Complete
    [rh6lgn01][1882] MY_EXAMPLES/generate_recursive$

I had some hope when I read the Java issue.当我阅读 Java 问题时，我有些希望。 There was one answer that noted that - in Java - there is a @Nullable tag that you can associate with a field in a record.有一个答案指出 - 在 Java 中 - 有一个 @Nullable 标记，您可以将其与记录中的字段相关联。 Here is a link to that issue: Storing null values in avro files这是该问题的链接： Storing null values in avro files

There is of course no such mechanism in the C++ language. C++ 语言中当然没有这种机制。 I did find in the Types.hh header the following line of code that somehow seemed related:我确实在 Types.hh header 中找到了以下似乎相关的代码行：

    /// define a type to identify Null in template functions
    struct AVRO_DECL Null { };

However I couldn't make heads-nor-tails of how to use it in similar fashion.但是，我无法对如何以类似方式使用它做出正面或反面。 So I'm either missing something or it has a different purpose.所以我要么错过了一些东西，要么它有不同的目的。 I fear the former but suspect the latter.我害怕前者，但怀疑后者。

And this is a link to the stackoverflow C issue, along with its answer, for completion: Write nullable item to avro record in Avro C这是指向 stackoverflow C 问题及其答案的链接，以完成： Write nullable item to avro record in Avro C

I am using version 1.9.2 of the Avro C++ library, running on a GNU/Linux box (not that it should matter, but for completion).我正在使用 Avro C++ 库的 1.9.2 版，在 GNU/Linux 机器上运行（这无关紧要，但为了完成）。

I will continue to prod and seek an answer, but if anyone has done this previously and can shed some light, I would appreciate the feedback.我将继续推动并寻求答案，但如果有人以前这样做过并且可以提供一些启示，我将不胜感激。

Thanks!谢谢！

Answer 1

Alright after toying with this until the wee-hours of the morning and all day today, I finally figured it out.好吧，一直玩到凌晨和今天一整天，我终于弄明白了。 So I thought I'd post an answer to my own question, in the event that someone else might be searching for the same information.所以我想我会发布我自己问题的答案，以防其他人可能正在搜索相同的信息。 Although I'll try to be brief, if you aren't into detail I'd suggest that you discontinue reading now.虽然我会尽量简明扼要，但如果你不详细，我建议你现在停止阅读。

In the end I discovered that there are two approaches one can take to resolve this issue.最后我发现有两种方法可以解决这个问题。 Both yield the same result, which is the ability to write data into a field/column in an Avro data file where that file has been declared as optional in the schema.两者产生相同的结果，即能够将数据写入 Avro 数据文件中的字段/列，其中该文件已在模式中声明为可选。 That is, it has the "null union" attached to its type.也就是说，它的类型附加了“空联合”。 I will begin my answer with the approach that is most related to the one I expressed in my original question.我将以与我在原始问题中表达的方法最相关的方法开始我的回答。 I will then provide an alternative solution and conclude with an observation or two.然后，我将提供一个替代解决方案，并以一两个观察结果结束。 Note that in both of these approaches, the JSON schema remains unchanged from what you read in my initial post.请注意，在这两种方法中，JSON 架构与您在我最初的帖子中阅读的内容保持不变。 The only items that changed were the header and the code body.唯一改变的项目是 header 和代码主体。 Schema did not change.架构没有改变。 See my initial post for that content.有关该内容，请参阅我的初始帖子。

So the first approach.所以第一种方法。 As with my first attempt, this approach involves the creation of a custom encoder and decoder (as shown in the header file in my original post), some JSON schema (mine was in a separate file) and then the primary body of code.与我的第一次尝试一样，这种方法涉及创建自定义编码器和解码器（如我原始帖子中的 header 文件所示）、一些 JSON 架构（我的架构在单独的文件中），然后是代码的主体。 To keep things short, the problem lied in the header, which I suspected.简而言之，问题出在我怀疑的 header 上。 To fix that, you need to avoid writing that header yourself for anything beyond the most rudimentary scenarios;要解决这个问题，您需要避免自己编写 header 用于超出最基本场景的任何内容； scenarios as shown in the examples that come with the Avro C++ distribution. Avro C++ 发行版附带的示例中所示的场景。 Rather, you should let the Avro tool named "avrogencpp" do the heavy lifting in regard to creating the custom encoder/decoder.相反，您应该让名为“avrogencpp”的 Avro 工具完成有关创建自定义编码器/解码器的繁重工作。 The reason I recommend making that choice is simply because the code that avrogencpp produced in that header is convoluted to say the least.我建议做出该选择的原因仅仅是因为 avrogencpp 在 header 中生成的代码至少可以说是复杂的。 Once you read it and understand it, the content makes sense, but for a record with more than a few fields at most the length becomes rather unwieldy for the human.一旦您阅读并理解了它，内容就很有意义，但是对于最多包含多个字段的记录，长度对于人类来说变得相当笨拙。 Thus let machines do what they do best.因此，让机器做他们最擅长的事情。 Anyway, this was the command I used:无论如何，这是我使用的命令：

avrogencpp -i recursive.json -o recursive.h -n recursive_namespace

The result was a header that, nestled in its innards, had a struct definition named "Root", which matched the name of my top-level, or outermost, record as defined in the unchanged schema (no coincidence).结果是一个 header，它依偎在它的内部，有一个名为“Root”的结构定义，它与我在未更改架构中定义的顶级或最外层记录的名称相匹配（并非巧合）。 And so with that, I could write the following in the main body of code:因此，我可以在代码主体中编写以下内容：

      avro::ValidSchema          recursiveSchema = loadSchema( schemaFilename );
      avro::DataFileWriter<recursive_namespace::Root>   dfw( filename, recursiveSchema );
      recursive_namespace::Root  record;
      // snipped for brevity
      record.fstring.set_string( "String set via direct record value assignment" );
      dfw.write( record );
      dfw.close();

This would be successful, as seen in the output:这将是成功的，如 output 所示：

[rh6lgn01][2174] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
recursiveSchema valid = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found.  Field count = 1
ProcessRecord(): {
ProcessRecord():   Field 0: type = string
ProcessDatum():   Field 0: value = String set via direct record value assignment (length = 45)
ProcessRecord(): }
rowCount = 1
-----------------------

AVRO Writing and Reading Complete
[rh6lgn01][2175] MY_EXAMPLES/generate_recursive$

And so that's that.就是这样。 Now to the second approach.现在到第二种方法。 This uses the GenericDatum class and is similar to the problem and answer shown in this stackoverflow discussion:这使用 GenericDatum class 并且类似于此 stackoverflow 讨论中显示的问题和答案：

How to read data from AVRO file using C++ interface? 如何使用 C++ 接口从 AVRO 文件中读取数据？

In a way one could argue that this approach has benefit in that you don't need a custom encoder/decoder and thus don't need the avrogencpp tool either.在某种程度上，有人可能会争辩说，这种方法的好处在于您不需要自定义编码器/解码器，因此也不需要 avrogencpp 工具。 While that is true, I must admit to wondering about the performance of using the generic "interface" in Avro.虽然这是真的，但我必须承认想知道在 Avro 中使用通用“接口”的性能。 'Just seems like it might be a tad slower than the direct route. '看起来它可能比直接路线慢一点。 However, it can read any file and is thus more flexible.但是，它可以读取任何文件，因此更加灵活。 I digress.我跑题了。 Back to the solution.回到解决方案。 The only code you need is in the main body.您需要的唯一代码在主体中。 Granted, what I am about to present is snipped to the bare essentials in order to demonstrate the approach.诚然，我将要介绍的内容被剪断到最基本的部分，以展示这种方法。 Therefore in-real-life you would need to flesh it out to include other types, etc. However it will convey the idea, which is all you need.因此，在现实生活中，您需要充实它以包含其他类型等。但是它会传达这个想法，这就是您所需要的。 And this is it:就是这样：

      avro::DataFileWriter<avro::GenericDatum>   writer( filename, schema );
      avro::GenericDatum    datum( schema );

      if( avro::AVRO_RECORD == datum.type() )
      {
         avro::GenericRecord  &record = datum.value<avro::GenericRecord>();
         for( uint32_t i = 0; i < record.fieldCount(); i++ )
         {
            avro::GenericDatum &fieldDatum = record.fieldAt( i );

            // So if the datum is a union, then it's likely that
            // the datum is an optional field.  We'd need to flesh
            // this out considerably to ensure that this was indeed
            // the case, but for brevity reasons, this will work:
            if( true == fieldDatum.isUnion() )
            {
                // Assuming the well-known Avro convention of the null
                // being first in the optional "syntax", then merely
                // jump to the second field which has the "actual type"
                // that the field/column is supposed to represent.
                // Again, this is in dire need of fleshing-out...
                fieldDatum.selectBranch( 1 );
                switch( fieldDatum.type() )
                {
                    case avro::AVRO_STRING:
                    {
                       std::string &newValue = fieldDatum.value<std::string>();
                       newValue = "New string set via switching branches in the union";
                       break;
                    }
                }
            }
            writer.write( datum );
      }
      writer.close();

This variant produces the following:此变体产生以下内容：

[rh6lgn01][2177] MY_EXAMPLES/generate_recursive$ recursive
schema=recursive.json
file=./DATA/recursive.avro
Top level was a record
The record had 1 fields.
Field datum was a union = true
Field datum 0 was a union.  Current branch = 0
Field datum 0 is now a string.  Current branch = 1
ReadFile(): Enter
ReadFile(): Type = record
ProcessRecord(): New record found.  Field count = 1
ProcessRecord(): {
ProcessRecord():   Field 0: type = string
ProcessDatum():   Field 0: value = New string set via switching branches in the union (length = 50)
ProcessRecord(): }
rowCount = 1
-----------------------

AVRO Writing and Reading Complete
[rh6lgn01][2178] MY_EXAMPLES/generate_recursive$

And so it is a satisfactory solution as well.所以这也是一个令人满意的解决方案。

For me, I'll likely go with the latter approach, as it just somehow seems "cleaner."对我来说，我可能会用后一种方法 go ，因为它看起来“更干净”。 That said, I think that the more-correct reason is that I use the generic "interface" to do the reading of Avro files, and so using it again for the purpose of writing seems more-congruent.也就是说，我认为更正确的原因是我使用通用“接口”来读取 Avro 文件，因此再次使用它来写入似乎更加一致。 In addition I prefer the second approach due to the lack of need to use avrogencpp.另外我更喜欢第二种方法，因为不需要使用 avrogencpp。 YMMV. YMMV。

I hope this answer helps someone in the future.我希望这个答案对将来的某人有所帮助。 Best of luck!祝你好运！

Jerry杰瑞

当字段可为空时，如何使用 C++ 接口在 Avro 中写入数据？

问题描述

1 个解决方案

解决方案1
1 2020-07-09 05:29:19

当字段可为空时，如何使用 C++ 接口在 Avro 中写入数据？

问题描述

1 个解决方案

解决方案1 1 2020-07-09 05:29:19

解决方案1
1 2020-07-09 05:29:19