简体   繁体   中英

Reading orc file of Hive managed tables in pyspark

I am trying to read orc file of a managed hive table using below pyspark code.

spark.read.format('orc').load('hive managed table path')

when i do a print schema on fetched dataframe, it is as follow

root
 |-- operation: integer (nullable = true)
 |-- originalTransaction: long (nullable = true)
 |-- bucket: integer (nullable = true)
 |-- rowId: long (nullable = true)
 |-- currentTransaction: long (nullable = true)
 |-- row: struct (nullable = true)
 |    |-- col1: float (nullable = true)
 |    |-- col2: integer (nullable = true)
 |-- partition_by_column: date (nullable = true)

Now i am not able to parse this data and do any manipulation on data frame. While applying action like show(), i am getting an error saying

java.lang.IllegalArgumentException: Include vector the wrong length

did someone face the same issue? if yes can you please suggest how to resolve it.

It's a known issue .

You get that error because you're trying to read Hive ACID table but Spark still doesn't have support for this.

Maybe you can export your Hive table to normal ORC files and then read them with Spark or try using alternatives like Hive JDBC as described here

As i am not sure about the versions You can try other ways to load the ORC file.

Using SqlContext

val df = sqlContext.read.format("orc").load(orcfile)

OR

val df= spark.read.option("inferSchema", true).orc("filepath")

OR SparkSql(recommended)

import spark.sql
sql("SELECT * FROM table_name").show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM