无法通过Spark 1.6从Parquet Hive表中读取数据

Question

I am trying to read data from hive table stored in Parquet format. 我正在尝试从以Parquet格式存储的配置单元表中读取数据。 I am using MapR distribution. 我正在使用MapR发行版。 After reading the data, when I try to do any operation eg df.show(3), it throws java.lang.ArrayIndexOutOfBoundsException: 7. If the table storage is changed to ORC, then it works. 读取数据后，当我尝试执行任何操作（例如df.show（3））时，它将引发java.lang.ArrayIndexOutOfBoundsException：7.如果表存储已更改为ORC，则它将起作用。

And also, I am trying to read from tables in shared cluster. 而且，我正在尝试从共享集群中的表中读取。 Therefore I can't change anything in the source table. 因此，我无法在源表中进行任何更改。

The Hive table structure, Hive表结构，

CREATE TABLE employee_p(
  emp_id bigint,
  f_name string,
  l_name string,
  sal double)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\u0001'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  'maprfs:/user/hive/warehouse/sptest.db/employee_p'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true',
  'numFiles'='1',
  'numRows'='4',
  'rawDataSize'='16',
  'totalSize'='699',
  'transient_lastDdlTime'='1550203019')

Java code, Java代码

    String warehouseLocation = args[0];
    String query1 = "select emp_id, f_name, l_name, sal from sptest.employee_p";

    SparkConf conf = new SparkConf().setAppName("Parquet Table");
    JavaSparkContext jsc = new JavaSparkContext(conf);
    HiveContext hc = new HiveContext(jsc);

    DataFrame df = hc.sql(query1);

    df.printSchema();
    df.show(10);

The job submit command, 作业提交命令，

    $SPARK_HOME/bin/spark-submit --class com.app.hive.FetchFromParquetTable \
    ${APP_HOME}/SparkTest-0.0.1-SNAPSHOT.jar maprfs:/user/hive/warehouse \
    --master yarn --deploy-mode cluster \
    --conf "spark.sql.parquet.writeLegacyFormat=true" \
    --conf "spark.sql.parquet.filterPushdown=false" \
    --queue myqueue

Excpetion, 错误时抛出，

19/02/14 21:08:23 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2, lgpbd1523.gso.aexp.com): java.lang.ArrayIndexOutOfBoundsException: 7
        at org.apache.parquet.bytes.BytesUtils.bytesToLong(BytesUtils.java:250)
        at org.apache.parquet.column.statistics.LongStatistics.setMinMaxFromBytes(LongStatistics.java:50)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:255)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:550)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:527)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:430)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
        at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
        at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
        at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
        at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)

Answer 1

I guess you are facing this problem while reading the table, you can confirm from your spark UI, since spark is lazy when you triggered an action it actually started reading the table and you have encountered this problem. 我猜您在读取表时遇到了这个问题，您可以从spark UI进行确认，因为当您触发一个实际上开始开始读取表的操作并遇到了此问题时，spark是惰性的。

I am facing the same issue while reading the table created by hive with parquet and snappy compression using spark 2.1.0 version with MapR distribution. 在读取由蜂巢通过镶木地板和活泼压缩创建的表时，使用带有MapR分布的spark 2.1.0版本时，我遇到了同样的问题。

Can you try changing the emp_id datatype from bigint to String? 您可以尝试将emp_id数据类型从bigint更改为String吗？

无法通过Spark 1.6从Parquet Hive表中读取数据

问题描述

1 个解决方案

解决方案1
0 2019-02-25 18:00:48

无法通过Spark 1.6从Parquet Hive表中读取数据

问题描述

1 个解决方案

解决方案1 0 2019-02-25 18:00:48

解决方案1
0 2019-02-25 18:00:48