简体   繁体   English

Java 中的协议缓冲区分隔 I/O 函数是否有 C++ 等效项?

[英]Are there C++ equivalents for the Protocol Buffers delimited I/O functions in Java?

I'm trying to read / write multiple Protocol Buffers messages from files, in both C++ and Java.我正在尝试使用 C++ 和 Java 从文件中读取/写入多个协议缓冲区消息。 Google suggests writing length prefixes before the messages, but there's no way to do that by default (that I could see). Google 建议在消息之前编写长度前缀,但默认情况下没有办法做到这一点(我可以看到)。

However, the Java API in version 2.1.0 received a set of "Delimited" I/O functions which apparently do that job:但是,2.1.0 版中的 Java API 收到了一组“分隔的”I/O 函数,它们显然可以完成这项工作:

parseDelimitedFrom
mergeDelimitedFrom
writeDelimitedTo

Are there C++ equivalents?有 C++ 等价物吗? And if not, what's the wire format for the size prefixes the Java API attaches, so I can parse those messages in C++?如果没有,Java API 附加的大小前缀的有线格式是什么,以便我可以在 C++ 中解析这些消息?


Update:更新:

These now exist in google/protobuf/util/delimited_message_util.h as of v3.3.0.从 v3.3.0 开始,这些现在存在于google/protobuf/util/delimited_message_util.h中。

I'm a bit late to the party here, but the below implementations include some optimizations missing from the other answers and will not fail after 64MB of input (though it still enforces the 64MB limit on each individual message, just not on the whole stream).我在这里参加聚会有点晚了,但是下面的实现包括其他答案中缺少的一些优化,并且在输入 64MB 后不会失败(尽管它仍然对每条消息强制执行64MB 限制,只是不在整个流上)。

(I am the author of the C++ and Java protobuf libraries, but I no longer work for Google. Sorry that this code never made it into the official lib. This is what it would look like if it had.) (我是 C++ 和 Java protobuf 库的作者,但我不再为 Google 工作。抱歉,这段代码从未进入官方库。这就是它的样子。)

bool writeDelimitedTo(
    const google::protobuf::MessageLite& message,
    google::protobuf::io::ZeroCopyOutputStream* rawOutput) {
  // We create a new coded stream for each message.  Don't worry, this is fast.
  google::protobuf::io::CodedOutputStream output(rawOutput);

  // Write the size.
  const int size = message.ByteSize();
  output.WriteVarint32(size);

  uint8_t* buffer = output.GetDirectBufferForNBytesAndAdvance(size);
  if (buffer != NULL) {
    // Optimization:  The message fits in one buffer, so use the faster
    // direct-to-array serialization path.
    message.SerializeWithCachedSizesToArray(buffer);
  } else {
    // Slightly-slower path when the message is multiple buffers.
    message.SerializeWithCachedSizes(&output);
    if (output.HadError()) return false;
  }

  return true;
}

bool readDelimitedFrom(
    google::protobuf::io::ZeroCopyInputStream* rawInput,
    google::protobuf::MessageLite* message) {
  // We create a new coded stream for each message.  Don't worry, this is fast,
  // and it makes sure the 64MB total size limit is imposed per-message rather
  // than on the whole stream.  (See the CodedInputStream interface for more
  // info on this limit.)
  google::protobuf::io::CodedInputStream input(rawInput);

  // Read the size.
  uint32_t size;
  if (!input.ReadVarint32(&size)) return false;

  // Tell the stream not to read beyond that size.
  google::protobuf::io::CodedInputStream::Limit limit =
      input.PushLimit(size);

  // Parse the message.
  if (!message->MergeFromCodedStream(&input)) return false;
  if (!input.ConsumedEntireMessage()) return false;

  // Release the limit.
  input.PopLimit(limit);

  return true;
}

Okay, so I haven't been able to find top-level C++ functions implementing what I need, but some spelunking through the Java API reference turned up the following, inside the MessageLite interface:好的,所以我一直无法找到实现我需要的顶级 C++ 函数,但是通过 Java API 参考进行了一些探索,在MessageLite界面中发现了以下内容:

void writeDelimitedTo(OutputStream output)
/*  Like writeTo(OutputStream), but writes the size of 
    the message as a varint before writing the data.   */

So the Java size prefix is a (Protocol Buffers) varint!所以 Java 大小前缀是一个 (Protocol Buffers) varint!

Armed with that information, I went digging through the C++ API and found the CodedStream header, which has these:有了这些信息,我深入研究了 C++ API 并找到了CodedStream标头,其中包含以下内容:

bool CodedInputStream::ReadVarint32(uint32 * value)
void CodedOutputStream::WriteVarint32(uint32 value)

Using those, I should be able to roll my own C++ functions that do the job.使用这些,我应该能够推出自己的 C++ 函数来完成这项工作。

They should really add this to the main Message API though;不过,他们真的应该将其添加到主消息 API 中; it's missing functionality considering Java has it, and so does Marc Gravell's excellent protobuf-net C# port (via SerializeWithLengthPrefix and DeserializeWithLengthPrefix).考虑到 Java 具有它,它缺少功能,Marc Gravell 出色的 protobuf-net C# 端口(通过 SerializeWithLengthPrefix 和 DeserializeWithLengthPrefix)也是如此。

I solved the same problem using CodedOutputStream/ArrayOutputStream to write the message (with the size) and CodedInputStream/ArrayInputStream to read the message (with the size).我解决了同样的问题,使用 CodedOutputStream/ArrayOutputStream 写入消息(带大小)和 CodedInputStream/ArrayInputStream 读取消息(带大小)。

For example, the following pseudo-code writes the message size following by the message:例如,以下伪代码在消息后面写入消息大小:

const unsigned bufLength = 256;
unsigned char buffer[bufLength];
Message protoMessage;

google::protobuf::io::ArrayOutputStream arrayOutput(buffer, bufLength);
google::protobuf::io::CodedOutputStream codedOutput(&arrayOutput);

codedOutput.WriteLittleEndian32(protoMessage.ByteSize());
protoMessage.SerializeToCodedStream(&codedOutput);

When writing you should also check that your buffer is large enough to fit the message (including the size).写入时,您还应该检查缓冲区是否足够大以适合消息(包括大小)。 And when reading, you should check that your buffer contains a whole message (including the size).并且在阅读时,您应该检查您的缓冲区是否包含完整的消息(包括大小)。

It definitely would be handy if they added convenience methods to C++ API similar to those provided by the Java API.如果他们向 C++ API 添加类似于 Java API 提供的便利方法,那肯定会很方便。

IsteamInputStream is very fragile to eofs and other errors that easily occurs when used together with std::istream. IsteamInputStream 对 eofs 和其他错误非常脆弱,当与 std::istream 一起使用时很容易发生这些错误。 After this the protobuf streams are permamently damaged and any already used buffer data is destroyed.在此之后,protobuf 流被永久损坏,任何已经使用的缓冲区数据都将被破坏。 There are proper support for reading from traditional streams in protobuf.从 protobuf 中的传统流读取有适当的支持。

Implement google::protobuf::io::CopyingInputStream and use that together with CopyingInputStreamAdapter .实现google::protobuf::io::CopyingInputStream并将其与CopyingInputStreamAdapter一起使用。 Do the same for the output variants.对输出变量执行相同操作。

In practice a parsing call ends up in google::protobuf::io::CopyingInputStream::Read(void* buffer, int size) where a buffer is given.实际上,解析调用以google::protobuf::io::CopyingInputStream::Read(void* buffer, int size) ,其中给出了缓冲区。 The only thing left to do is read into it somehow.唯一剩下要做的就是以某种方式阅读它。

Here's an example for use with Asio synchronized streams ( SyncReadStream / SyncWriteStream ):下面是一个与 Asio 同步流( SyncReadStream / SyncWriteStream )一起使用的示例:

#include <google/protobuf/io/zero_copy_stream_impl_lite.h>

using namespace google::protobuf::io;


template <typename SyncReadStream>
class AsioInputStream : public CopyingInputStream {
    public:
        AsioInputStream(SyncReadStream& sock);
        int Read(void* buffer, int size);
    private:
        SyncReadStream& m_Socket;
};


template <typename SyncReadStream>
AsioInputStream<SyncReadStream>::AsioInputStream(SyncReadStream& sock) :
    m_Socket(sock) {}


template <typename SyncReadStream>
int
AsioInputStream<SyncReadStream>::Read(void* buffer, int size)
{
    std::size_t bytes_read;
    boost::system::error_code ec;
    bytes_read = m_Socket.read_some(boost::asio::buffer(buffer, size), ec);

    if(!ec) {
        return bytes_read;
    } else if (ec == boost::asio::error::eof) {
        return 0;
    } else {
        return -1;
    }
}


template <typename SyncWriteStream>
class AsioOutputStream : public CopyingOutputStream {
    public:
        AsioOutputStream(SyncWriteStream& sock);
        bool Write(const void* buffer, int size);
    private:
        SyncWriteStream& m_Socket;
};


template <typename SyncWriteStream>
AsioOutputStream<SyncWriteStream>::AsioOutputStream(SyncWriteStream& sock) :
    m_Socket(sock) {}


template <typename SyncWriteStream>
bool
AsioOutputStream<SyncWriteStream>::Write(const void* buffer, int size)
{   
    boost::system::error_code ec;
    m_Socket.write_some(boost::asio::buffer(buffer, size), ec);
    return !ec;
}

Usage:用法:

AsioInputStream<boost::asio::ip::tcp::socket> ais(m_Socket); // Where m_Socket is a instance of boost::asio::ip::tcp::socket
CopyingInputStreamAdaptor cis_adp(&ais);
CodedInputStream cis(&cis_adp);

Message protoMessage;
uint32_t msg_size;

/* Read message size */
if(!cis.ReadVarint32(&msg_size)) {
    // Handle error
 }

/* Make sure not to read beyond limit of message */
CodedInputStream::Limit msg_limit = cis.PushLimit(msg_size);
if(!msg.ParseFromCodedStream(&cis)) {
    // Handle error
}

/* Remove limit */
cis.PopLimit(msg_limit);

Here you go:干得好:

#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <google/protobuf/io/coded_stream.h>

using namespace google::protobuf::io;

class FASWriter 
{
    std::ofstream mFs;
    OstreamOutputStream *_OstreamOutputStream;
    CodedOutputStream *_CodedOutputStream;
public:
    FASWriter(const std::string &file) : mFs(file,std::ios::out | std::ios::binary)
    {
        assert(mFs.good());

        _OstreamOutputStream = new OstreamOutputStream(&mFs);
        _CodedOutputStream = new CodedOutputStream(_OstreamOutputStream);
    }

    inline void operator()(const ::google::protobuf::Message &msg)
    {
        _CodedOutputStream->WriteVarint32(msg.ByteSize());

        if ( !msg.SerializeToCodedStream(_CodedOutputStream) )
            std::cout << "SerializeToCodedStream error " << std::endl;
    }

    ~FASWriter()
    {
        delete _CodedOutputStream;
        delete _OstreamOutputStream;
        mFs.close();
    }
};

class FASReader
{
    std::ifstream mFs;

    IstreamInputStream *_IstreamInputStream;
    CodedInputStream *_CodedInputStream;
public:
    FASReader(const std::string &file), mFs(file,std::ios::in | std::ios::binary)
    {
        assert(mFs.good());

        _IstreamInputStream = new IstreamInputStream(&mFs);
        _CodedInputStream = new CodedInputStream(_IstreamInputStream);      
    }

    template<class T>
    bool ReadNext()
    {
        T msg;
        unsigned __int32 size;

        bool ret;
        if ( ret = _CodedInputStream->ReadVarint32(&size) )
        {   
            CodedInputStream::Limit msgLimit = _CodedInputStream->PushLimit(size);
            if ( ret = msg.ParseFromCodedStream(_CodedInputStream) )
            {
                _CodedInputStream->PopLimit(msgLimit);      
                std::cout << mFeed << " FASReader ReadNext: " << msg.DebugString() << std::endl;
            }
        }

        return ret;
    }

    ~FASReader()
    {
        delete _CodedInputStream;
        delete _IstreamInputStream;
        mFs.close();
    }
};

I ran into the same issue in both C++ and Python.我在 C++ 和 Python 中都遇到了同样的问题。

For the C++ version, I used a mix of the code Kenton Varda posted on this thread and the code from the pull request he sent to the protobuf team (because the version posted here doesn't handle EOF while the one he sent to github does).对于 C++ 版本,我混合使用了 Kenton Varda 在此线程上发布的代码和他发送给 protobuf 团队的拉取请求中的代码(因为这里发布的版本不处理 EOF,而他发送到 github 的版本可以)。

#include <google/protobuf/message_lite.h>
#include <google/protobuf/io/zero_copy_stream.h>
#include <google/protobuf/io/coded_stream.h>


bool writeDelimitedTo(const google::protobuf::MessageLite& message,
    google::protobuf::io::ZeroCopyOutputStream* rawOutput)
{
    // We create a new coded stream for each message.  Don't worry, this is fast.
    google::protobuf::io::CodedOutputStream output(rawOutput);

    // Write the size.
    const int size = message.ByteSize();
    output.WriteVarint32(size);

    uint8_t* buffer = output.GetDirectBufferForNBytesAndAdvance(size);
    if (buffer != NULL)
    {
        // Optimization:  The message fits in one buffer, so use the faster
        // direct-to-array serialization path.
        message.SerializeWithCachedSizesToArray(buffer);
    }

    else
    {
        // Slightly-slower path when the message is multiple buffers.
        message.SerializeWithCachedSizes(&output);
        if (output.HadError())
            return false;
    }

    return true;
}

bool readDelimitedFrom(google::protobuf::io::ZeroCopyInputStream* rawInput, google::protobuf::MessageLite* message, bool* clean_eof)
{
    // We create a new coded stream for each message.  Don't worry, this is fast,
    // and it makes sure the 64MB total size limit is imposed per-message rather
    // than on the whole stream.  (See the CodedInputStream interface for more
    // info on this limit.)
    google::protobuf::io::CodedInputStream input(rawInput);
    const int start = input.CurrentPosition();
    if (clean_eof)
        *clean_eof = false;


    // Read the size.
    uint32_t size;
    if (!input.ReadVarint32(&size))
    {
        if (clean_eof)
            *clean_eof = input.CurrentPosition() == start;
        return false;
    }
    // Tell the stream not to read beyond that size.
    google::protobuf::io::CodedInputStream::Limit limit = input.PushLimit(size);

    // Parse the message.
    if (!message->MergeFromCodedStream(&input)) return false;
    if (!input.ConsumedEntireMessage()) return false;

    // Release the limit.
    input.PopLimit(limit);

    return true;
}

And here is my python2 implementation:这是我的python2实现:

from google.protobuf.internal import encoder
from google.protobuf.internal import decoder

#I had to implement this because the tools in google.protobuf.internal.decoder
#read from a buffer, not from a file-like objcet
def readRawVarint32(stream):
    mask = 0x80 # (1 << 7)
    raw_varint32 = []
    while 1:
        b = stream.read(1)
        #eof
        if b == "":
            break
        raw_varint32.append(b)
        if not (ord(b) & mask):
            #we found a byte starting with a 0, which means it's the last byte of this varint
            break
    return raw_varint32

def writeDelimitedTo(message, stream):
    message_str = message.SerializeToString()
    delimiter = encoder._VarintBytes(len(message_str))
    stream.write(delimiter + message_str)

def readDelimitedFrom(MessageType, stream):
    raw_varint32 = readRawVarint32(stream)
    message = None

    if raw_varint32:
        size, _ = decoder._DecodeVarint32(raw_varint32, 0)

        data = stream.read(size)
        if len(data) < size:
            raise Exception("Unexpected end of file")

        message = MessageType()
        message.ParseFromString(data)

    return message

#In place version that takes an already built protobuf object
#In my tests, this is around 20% faster than the other version 
#of readDelimitedFrom()
def readDelimitedFrom_inplace(message, stream):
    raw_varint32 = readRawVarint32(stream)

    if raw_varint32:
        size, _ = decoder._DecodeVarint32(raw_varint32, 0)

        data = stream.read(size)
        if len(data) < size:
            raise Exception("Unexpected end of file")

        message.ParseFromString(data)

        return message
    else:
        return None

It might not be the best looking code and I'm sure it can be refactored a fair bit, but at least that should show you one way to do it.它可能不是最好看的代码,我相信它可以重构相当多,但至少应该向您展示一种方法。

Now the big problem: It's SLOW .现在最大的问题是:它很

Even when using the C++ implementation of python-protobuf, it's one order of magnitude slower than in pure C++.即使使用 python-protobuf 的 C++ 实现,它也比纯 C++ 慢一个数量级。 I have a benchmark where I read 10M protobuf messages of ~30 bytes each from a file.我有一个基准测试,我从文件中读取 10M protobuf 消息,每个消息约 30 个字节。 It takes ~0.9s in C++, and 35s in python.在 C++ 中需要大约 0.9 秒,在 python 中需要 35 秒。

One way to make it a bit faster would be to re-implement the varint decoder to make it read from a file and decode in one go, instead of reading from a file and then decoding as this code currently does.让它更快一点的一种方法是重新实现 varint 解码器,使其从文件中读取并一次性解码,而不是像当前代码那样从文件中读取然后解码。 (profiling shows that a significant amount of time is spent in the varint encoder/decoder). (分析显示在 varint 编码器/解码器中花费了大量时间)。 But needless to say that alone is not enough to close the gap between the python version and the C++ version.但不用说,仅凭这一点还不足以缩小 python 版本和 C++ 版本之间的差距。

Any idea to make it faster is very welcome :)任何让它更快的想法都非常受欢迎:)

Just for completeness, I post here an up-to-date version that works with the master version of protobuf and Python3为了完整起见,我在这里发布了一个与 protobuf 和 Python3 的主版本一起使用的最新版本

For the C++ version it is sufficient to use the utils in delimited_message_utils.h, here a MWE对于 C++ 版本,使用 delimited_message_utils.h 中的工具就足够了,这里是一个 MWE

#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <google/protobuf/util/delimited_message_util.h>

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

template <typename T>
bool writeManyToFile(std::deque<T> messages, std::string filename) {
    int outfd = open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC);
    google::protobuf::io::FileOutputStream fout(outfd);

    bool success;
    for (auto msg: messages) {
        success = google::protobuf::util::SerializeDelimitedToZeroCopyStream(
            msg, &fout);
        if (! success) {
            std::cout << "Writing Failed" << std::endl;
            break;
        }
    }
    fout.Close();
    close(outfd);
    return success;
}

template <typename T>
std::deque<T> readManyFromFile(std::string filename) {
    int infd = open(filename.c_str(), O_RDONLY);

    google::protobuf::io::FileInputStream fin(infd);
    bool keep = true;
    bool clean_eof = true;
    std::deque<T> out;

    while (keep) {
        T msg;
        keep = google::protobuf::util::ParseDelimitedFromZeroCopyStream(
            &msg, &fin, nullptr);
        if (keep)
            out.push_back(msg);
    }
    fin.Close();
    close(infd);
    return out;
}

For the Python3 version, building on @fireboot 's answer, the only thing thing that needed modification is the decoding of raw_varint32对于 Python3 版本,基于 @fireboot 的回答,唯一需要修改的是 raw_varint32 的解码

def getSize(raw_varint32):
    result = 0
    shift = 0
    b = six.indexbytes(raw_varint32, 0)
    result |= ((ord(b) & 0x7f) << shift)
    return result

def readDelimitedFrom(MessageType, stream):
    raw_varint32 = readRawVarint32(stream)
    message = None

    if raw_varint32:
        size = getSize(raw_varint32)

        data = stream.read(size)
        if len(data) < size:
            raise Exception("Unexpected end of file")

        message = MessageType()
        message.ParseFromString(data)

    return message

Was also looking for a solution for this.也在为此寻找解决方案。 Here's the core of our solution, assuming some java code wrote many MyRecord messages with writeDelimitedTo into a file.这是我们解决方案的核心,假设一些 java 代码将许多带有writeDelimitedTo MyRecord 消息写入一个文件。 Open the file and loop, doing:打开文件并循环,执行:

if(someCodedInputStream->ReadVarint32(&bytes)) {
  CodedInputStream::Limit msgLimit = someCodedInputStream->PushLimit(bytes);
  if(myRecord->ParseFromCodedStream(someCodedInputStream)) {
    //do your stuff with the parsed MyRecord instance
  } else {
    //handle parse error
  }
  someCodedInputStream->PopLimit(msgLimit);
} else {
  //maybe end of file
}

Hope it helps.希望能帮助到你。

Working with an objective-c version of protocol-buffers, I ran into this exact issue.使用objective-c 版本的protocol-buffers,我遇到了这个确切的问题。 On sending from the iOS client to a Java based server that uses parseDelimitedFrom, which expects the length as the first byte, I needed to call writeRawByte to the CodedOutputStream first.在从 iOS 客户端发送到使用 parseDelimitedFrom 的基于 Java 的服务器时,它期望长度为第一个字节,我需要先将 writeRawByte 调用到 CodedOutputStream。 Posting here to hopegully help others that run into this issue.在这里发帖希望能帮助遇到这个问题的其他人。 While working through this issue, one would think that Google proto-bufs would come with a simply flag which does this for you...在解决这个问题时,有人会认为 Google proto-bufs 会带有一个简单的标志,可以为您执行此操作...

    Request* request = [rBuild build];

    [self sendMessage:request];
} 


- (void) sendMessage:(Request *) request {

    //** get length
    NSData* n = [request data];
    uint8_t len = [n length];

    PBCodedOutputStream* os = [PBCodedOutputStream streamWithOutputStream:outputStream];
    //** prepend it to message, such that Request.parseDelimitedFrom(in) can parse it properly
    [os writeRawByte:len];
    [request writeToCodedOutputStream:os];
    [os flush];
}

Since I'm not allowed to write this as a comment to Kenton Varda's answer above;由于我不允许将其写为对上面肯顿·瓦尔达 (Kenton Varda) 回答的评论; I believe there is a bug in the code he posted (as well as in other answers which have been provided).我相信他发布的代码中存在错误(以及提供的其他答案)。 The following code:以下代码:

...
google::protobuf::io::CodedInputStream input(rawInput);

// Read the size.
uint32_t size;
if (!input.ReadVarint32(&size)) return false;

// Tell the stream not to read beyond that size.
google::protobuf::io::CodedInputStream::Limit limit =
    input.PushLimit(size);
...

sets an incorrect limit because it does not take into account the size of the varint32 which has already been read from input.设置不正确的限制,因为它没有考虑已经从输入中读取的 varint32 的大小。 This can result in data loss/corruption as additional bytes are read from the stream which may be part of the next message.这可能会导致数据丢失/损坏,因为从流中读取了额外的字节,这可能是下一条消息的一部分。 The usual way of handling this correctly is to delete the CodedInputStream used to read the size and create a new one for reading the payload:正确处理此问题的常用方法是删除用于读取大小的 CodedInputStream 并创建一个新的用于读取有效负载:

...
uint32_t size;
{
  google::protobuf::io::CodedInputStream input(rawInput);

  // Read the size.
  if (!input.ReadVarint32(&size)) return false;
}

google::protobuf::io::CodedInputStream input(rawInput);

// Tell the stream not to read beyond that size.
google::protobuf::io::CodedInputStream::Limit limit =
    input.PushLimit(size);
...

You can use getline for reading a string from a stream, using the specified delimiter:您可以使用 getline 使用指定的分隔符从流中读取字符串:

istream& getline ( istream& is, string& str, char delim );

(defined in the header) (在标题中定义)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM