简体   繁体   中英

spark load textfile-based hive table as a dataframe (scala)

suppose we store a table as textfile in Hive. We have two columns in this table: id & groupid.

The hdfs storage path looks like this: (groupid is also the partition column)

../groupid=1/1
../groupid=2/2
../groupid=3/3
...

Each textfile(1,2,3...) stores a list of ids.

For example, the content of file 1 is:

123
2358
3456
... 

Is it possible for me to read this table as a dataframe?

The result dataframe should be

groupid | id
1       | 123
1       | 2358
1       | 3456
2       | ...
2       | ...
3       | ...
...     | ...

spark-sql is not possible, cuz there are massive partitions

By default; spark identifies hive-style partitioning as soon as you give a basePath as an option. Assuming your groupid directories are located in "/AA/BB/CC". You can list the records:

val basePath="/AA/BB/CC"
val df = spark.read.option("basePath",basePath).csv(basePath+"/group*").show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM