简体   繁体   中英

Can't deserialize Protobuf (2.6.1) data using elephant-bird and Hive in AWS

I am not able to deserialize the protobuf data that has repeated string in it using elephant-bird 4.14 with Hive. This seems to be because repeated string feature is available only with Protobuf 2.6 and not in Protobuf 2.5. While running my hive queries in AWS EMR cluster, it uses Protobuf 2.5 that is bundled with AWS Hive. Even after adding Protobuf 2.6 jar explicitly, i am not able to get rid of this error. I want to know how can i make hive to use Protobuf 2.6 jar that i add explicitly.

Below are the hive queries used:

    add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
    add jar s3://gam.test/hive-jars/GAMDataModel-1.0.jar;
    add jar s3://gam.test/hive-jars/GAMCoreModel-1.0.jar;
    add jar s3://gam.test/hive-jars/GAMAccessLayer-1.1.jar;
    add jar s3://gam.test/hive-jars/RodbHiveStorageHandler-0.12.0-jarjar-final.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-core-4.14.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-hive-4.14.jar;
    add jar s3://gam.test/hive-jars/elephant-bird-hadoop-compat-4.14.jar;
    add jar s3://gam.test/hive-jars/protobuf-java-2.6.1.jar;
    add jar s3://gam.test/hive-jars/GamProtoBufHiveDeserializer-1.0-jarjar.jar;
    drop table GamRelationRodb;

    CREATE EXTERNAL TABLE GamRelationRodb
    row format serde "com.amazon.hive.serde.GamProtobufDeserializer"
    with serdeproperties("serialization.class"= 
 "com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper")
    STORED BY 'com.amazon.rodb.hadoop.hive.RodbHiveStorageHandler' TBLPROPERTIES 
    ("file.name" = 'GAM_Relationship',"file.path" ='s3://pathtofile/');

    select * from GamRelationRodb limit 10;

Below is the format of the Protobuf file:

message RepeatedRelationshipWrapper { 
    repeated relationship.Relationship relationships = 1;
}

message Relationship {
    required RelationshipType type = 1;
    repeated string ids = 2;
}

enum RelationshipType {
    UKNOWN_RELATIONSHIP_TYPE = 0;
    PARENT = 1;
    CHILD = 2;
}

Below is the runtime exception thrown while running the query:

Exception in thread "main" java.lang.NoSuchMethodError: com.google.protobuf.LazyStringList.getUnmodifiableView()Lcom/google/protobuf/LazyStringList;
    at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:215)
    at com.amazon.gam.model.RelationshipProto$Relationship.<init>(RelationshipProto.java:137)
    at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:239)
    at com.amazon.gam.model.RelationshipProto$Relationship$1.parsePartialFrom(RelationshipProto.java:234)
    at com.google.protobuf.CodedInputStream.readMessage(CodedInputStream.java:309)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:126)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper.<init>(RepeatedRelationshipWrapperProto.java:72)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:162)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$1.parsePartialFrom(RepeatedRelationshipWrapperProto.java:157)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:495)
    at com.amazon.gam.rodb.model.RepeatedRelationshipWrapperProto$RepeatedRelationshipWrapper$Builder.mergeFrom(RepeatedRelationshipWrapperProto.java:355)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:337)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
    at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:170)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:882)
    at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:267)
    at com.twitter.elephantbird.mapreduce.io.ProtobufConverter.fromBytes(ProtobufConverter.java:66)
    at com.twitter.elephantbird.hive.serde.ProtobufDeserializer.deserialize(ProtobufDeserializer.java:59)
    at com.amazon.hive.serde.GamProtobufDeserializer.deserialize(GamProtobufDeserializer.java:63)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:502)
    at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:428)
    at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
    at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2098)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:252)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:183)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:399)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:776)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:714)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:641)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Protobuf is brittle library. It may be wire-format compatible across versions 2.x, but the classes generated by protoc will only link against the protobuf JAR of exactly the same version as that of the protoc compiler.

This means, fundamentally, that you cannot update protobuf except by choreographing this across all dependencies. The Great Protobuf upgrade in 2013 was when Hadoop, Hbase, Hive &c upgrade, and after that: everyone has frozen at v 2.5, probably for the entire life of the Hadoop 2.x codeline, unless it all gets shaded away or Java 9 hides the problem.

We are more scared of protobuf updates than upgrades to Guava and Jackson, as the latter only breaks every single library, not the wire format .

Watch HADOOP-13363 for the topic of a 2.x upgrade, and HDFS-11010 on the question of a move up to protobuf 3 in hadoop trunk. That's messy as it does change wire format, the protobuf-json marshalling breaks and other things.

It's best just to conclude, "binary compatibility of protobuf code has been found lacking", and stick to protobuf 2.5. Sorry.

You could take the entire stack of libraries you want to use, rebuild them with an updated protoc compiler, matching protobuf.jor, with any other patches you need applied. I would only recommend that to the bold —but am curious about the outcome. If you do try this, let us know how it worked out

Further reading fear of dependencies

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM