协议缓冲区问题，多个序列化成二进制文件

Question

I am getting some weird behaviour from protobuf binary file io. 我从protobuf二进制文件io得到了一些奇怪的行为。 I am pre-processing a text corpus into a protobuf intermediary file. 我正在将文本语料库预处理为protobuf中间文件。 My serialization class looks as follows: 我的序列化类看起来如下：

  class pb_session_printer
  {
  public:
    pb_session_printer(std::string & filename)
      : out(filename.c_str(), std::fstream::out | std::fstream::trunc | 
                              std::fstream|binary)
      {}

    void print_batch(std::vector<session> & pb_sv)
    {
      boost::lock_guard<boost::mutex> lock(m);

      BOOST_FOREACH(session & s, pb_sv)
      {
        std::cout << out.tellg() << ":";
        s.SerializeToOstream(&out);
        out.flush();
        std::cout << s.session_id() << ":" << s.action_size() << std::endl;
      }
      exit(0);
    }

    std::fstream out;
    boost::mutex m;
  };

A snippet of output looks like : 一段输出看起来像：

0:0:8
132:1:8
227:2:6
303:3:6
381:4:19
849:5:9
1028:6:2
1048:7:18
1333:8:28
2473:9:24

The first field shows that serialization is proceeding as normal. 第一个字段显示序列化正常进行。

When I run my loading program : 当我运行我的加载程序时：

int main()
{
  std::fstream in_file("out_file", std::fstream::in | std::ios::binary);
  session s;

  std::cout << in_file.tellg() << std::endl;
  s.ParseFromIstream(&in_file);
  std::cout << in_file.tellg() << std::endl;
  std::cout << s.session_id() << std::endl;

  s.ParseFromIstream(&in_file);
}

I get: 我明白了：

0
-1
111
libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type 
"session" because it is missing required fields: session_id

session_id : 111 is an entry towards the end of the stream, I clearly don't understand the semantics of binary-io facilities of the library. session_id：111是流的末尾的一个条目，我显然不理解库的binary-io设施的语义。 Please help. 请帮忙。

Answer 1

If you write multiple protobuffers in a single file you will need to write the size of the protobuf + protobuffer and read them in seperately (so without ParseFromIstream as Cat Plus Plus mentioned). 如果在单个文件中编写多个protobuffers，则需要编写protobuf + protobuffer的大小并单独读取它们（因此没有像Cat Plus Plus中提到的ParseFromIstream ）。 When you have read in the protobuffer you can parse it with ParseFromArray . 当您在protobuffer中读取时，可以使用ParseFromArray对其进行ParseFromArray 。

Your file would look size this (the spaces are just for readability): 您的文件看起来大小（这些空间仅用于提高可读性）：

size protobuf size protobuf size protobuf etc. 大小protobuf大小protobuf大小protobuf等

Answer 2

Message::ParseFromIstream is documented to consume the entire input. Message::ParseFromIstream 被记录为使用整个输入。 Since you're serialising a sequence of messages of the same type, you can just create a new message with repeated field of that type, and work with that. 由于您正在序列化相同类型的消息序列，因此您可以使用该类型的repeated字段创建新消息，并使用该消息。

协议缓冲区问题，多个序列化成二进制文件

问题描述

2 个解决方案

解决方案1
4 已采纳 2011-12-01 19:01:15

解决方案2
3 2011-12-01 17:53:57

协议缓冲区问题，多个序列化成二进制文件

问题描述

2 个解决方案

解决方案1 4 已采纳 2011-12-01 19:01:15

解决方案2 3 2011-12-01 17:53:57

解决方案1
4 已采纳 2011-12-01 19:01:15

解决方案2
3 2011-12-01 17:53:57