简体   繁体   中英

Read Snappy compressed Hive RCFile in Apache Pig

Trying to read Hive files in Pig using http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/HiveColumnarLoader.html

Fies have RCF , SnappyCodec and hive.io.rcfile.column.number words in its beginning, they are binary files. Moreover they are partitioned over multiple directories (like /day=20140701 ).

However simple script of loading, grouping and counting rows prints nothing to output. If I try to add "ILLUSTRATE" like this:

rows = LOAD ... using HiveColumnarLoader ...;
ILLUSTRATE rows;

I get error like this:

2014-07-17 14:16:43,086 [main] ERROR org.apache.pig.pen.AugmentBaseDataVisitor - No (valid) input data found!
java.lang.RuntimeException: No (valid) input data found!
    at org.apache.pig.pen.AugmentBaseDataVisitor.visit(AugmentBaseDataVisitor.java:583)
    at org.apache.pig.newplan.logical.relational.LOLoad.accept(LOLoad.java:229)
    at org.apache.pig.pen.util.PreOrderDepthFirstWalker.depthFirst(PreOrderDepthFirstWalker.java:82)
    at org.apache.pig.pen.util.PreOrderDepthFirstWalker.walk(PreOrderDepthFirstWalker.java:66)
    at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)
    at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:180)
    at org.apache.pig.PigServer.getExamples(PigServer.java:1180)
...

I'm not sure, whether it is because of Snappy compression or some trouble with specifying schema (I copied it from hive, describe table).

Could anyone please confirm that HiveColumnarLoader works with snappy compressed files or propose another approach?

Thanks in advance!

Have you tried the HCatLoader?

rows = LOAD 'tablename' using org.apache.hcatalog.pig.HCatLoader();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM