spark load textfile-based hive table as a dataframe (scala)

Question

suppose we store a table as textfile in Hive. We have two columns in this table: id & groupid.

The hdfs storage path looks like this: (groupid is also the partition column)

../groupid=1/1
../groupid=2/2
../groupid=3/3
...

Each textfile(1,2,3...) stores a list of ids.

For example, the content of file 1 is:

Is it possible for me to read this table as a dataframe?

The result dataframe should be

groupid | id
1       | 123
1       | 2358
1       | 3456
2       | ...
2       | ...
3       | ...
...     | ...

spark-sql is not possible, cuz there are massive partitions

Answer 1

By default; spark identifies hive-style partitioning as soon as you give a basePath as an option. Assuming your groupid directories are located in "/AA/BB/CC". You can list the records:

val basePath="/AA/BB/CC"
val df = spark.read.option("basePath",basePath).csv(basePath+"/group*").show()

spark load textfile-based hive table as a dataframe (scala)

Question

1 answers

solution1
0 2019-02-25 14:21:04

spark load textfile-based hive table as a dataframe (scala)

Question

1 answers

solution1 0 2019-02-25 14:21:04

solution1
0 2019-02-25 14:21:04