Reading orc file of Hive managed tables in pyspark

Question

I am trying to read orc file of a managed hive table using below pyspark code.

spark.read.format('orc').load('hive managed table path')

when i do a print schema on fetched dataframe, it is as follow

root
 |-- operation: integer (nullable = true)
 |-- originalTransaction: long (nullable = true)
 |-- bucket: integer (nullable = true)
 |-- rowId: long (nullable = true)
 |-- currentTransaction: long (nullable = true)
 |-- row: struct (nullable = true)
 |    |-- col1: float (nullable = true)
 |    |-- col2: integer (nullable = true)
 |-- partition_by_column: date (nullable = true)

Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying

java.lang.IllegalArgumentException: Include vector the wrong length

did someone face the same issue? if yes can you please suggest how to resolve it.

Answer 1

It's a known issue .

You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.

Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here

Answer 2

As i am not sure about the versions You can try other ways to load the ORC file.

Using SqlContext

val df = sqlContext.read.format("orc").load(orcfile)

OR

val df= spark.read.option("inferSchema", true).orc("filepath")

OR SparkSql(recommended)

import spark.sql
sql("SELECT * FROM table_name").show()

Reading orc file of Hive managed tables in pyspark

Question

2 answers

solution1
1 2019-12-09 18:22:25

solution2
0 2019-12-09 10:03:58

Reading orc file of Hive managed tables in pyspark

Question

2 answers

solution1 1 2019-12-09 18:22:25

solution2 0 2019-12-09 10:03:58

solution1
1 2019-12-09 18:22:25

solution2
0 2019-12-09 10:03:58