suppose we store a table as textfile in Hive. We have two columns in this table: id & groupid.
The hdfs storage path looks like this: (groupid is also the partition column)
../groupid=1/1
../groupid=2/2
../groupid=3/3
...
Each textfile(1,2,3...) stores a list of ids.
For example, the content of file 1 is:
123
2358
3456
...
Is it possible for me to read this table as a dataframe?
The result dataframe should be
groupid | id
1 | 123
1 | 2358
1 | 3456
2 | ...
2 | ...
3 | ...
... | ...
spark-sql is not possible, cuz there are massive partitions
By default; spark identifies hive-style partitioning as soon as you give a basePath as an option. Assuming your groupid directories are located in "/AA/BB/CC". You can list the records:
val basePath="/AA/BB/CC"
val df = spark.read.option("basePath",basePath).csv(basePath+"/group*").show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.