简体   繁体   中英

spark-shell load existing hive table by partition?

In spark-shell , how do I load an existing Hive table, but only one of its partitions?

val df = spark.read.format("orc").load("mytable")

I was looking for a way so it only loads one particular partition of this table.

Thanks!

There is no direct way in spark.read.format but you can use where condition

val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)

unless until you perform an action nothing is loaded , since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.

see here DataFrameReader

 def load(paths: String*): DataFrame = {
...
}

In above code ie spark.read .... where is just where condition when you specify this, again data wont be loaded immediately:-)

when you say df.count then your parition column will be appled on data path of orc.

There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.

val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM