简体   繁体   中英

Not able to load data into pig on EMR from S3 bucket (parquet file)

I want to load the data from s3 bucket in Pig on EMR and my source file format is parquet:

Below command i have used:

A = LOAD 's3://test-1/icted/emp_db/emp_tb' 
USING parquet.pig.ParquetLoader(header__change_seq:chararray,header__change_oper:chararray,header__change_mask:chararray,header__stream_position:chararray,header__operation:chararray,header__transaction_id:chararray,header__timestamp:chararray,policylangaccessind_afi:chararray,loadcommandid:double,previousgroupid:double,enddate:chararray,assignedbyuserid:double,dstcd_afi:chararray);

I am not able to load data getting below are the error:

      ERROR pig.PigServer: exception during parsing: Error during parsing. <file test.pig, line 20, column 2>  mismatched input 'header__change_seq' expecting RIGHT_PAREN

Need help on this.

Couple of things:

You should be using the full class path on emr org.apache.parquet.pig.ParquetLoader() ; no need to pass it a schema, the parquet reader will infer it for you.

Make sure you are using a version of the pig code that is compatible with the version of the parquet file (parquet tools can be used to find the version of parquet used)

Just try to use most recent version https://mvnrepository.com/artifact/org.apache.parquet/parquet-pig-bundle/1.10.0

REGISTER parquet-pig-bundle-1.10.0.jar;

Missing ''

A = LOAD 's3://test-1/icted/emp_db/emp_tb' USING parquet.pig.ParquetLoader('header__change_seq:chararray,header__change_oper:chararray,header__change_mask:chararray,header__stream_position:chararray,header__operation:chararray,header__transaction_id:chararray,header__timestamp:chararray,policylangaccessind_afi:chararray,loadcommandid:double,previousgroupid:double,enddate:chararray,assignedbyuserid:double,dstcd_afi:chararray');

Or

Missing as after ParquetLoader

A = LOAD 's3://test-1/icted/emp_db/emp_tb' USING parquet.pig.ParquetLoader AS (header__change_seq:chararray,header__change_oper:chararray,header__change_mask:chararray,header__stream_position:chararray,header__operation:chararray,header__transaction_id:chararray,header__timestamp:chararray,policylangaccessind_afi:chararray,loadcommandid:double,previousgroupid:double,enddate:chararray,assignedbyuserid:double,dstcd_afi:chararray);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM