Spark read CSV file using Data Frame and query from PostgreSQL DB

Question

I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below

Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
                .option("delimiter", "|").option("header", true).load(inputDataPath);

Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records. Any help appreciated. I'm using Java.

Answer 1

Like @mck also pointed out: the best way is to use join. with spark you can read external jdbc table using the DataRame Api for example

val props = Map(....)
spark.read.format("jdbc").options(props).load()

see the DataFrameReader scaladoc for more options and which properties and values you need to set.

then use join to merge fields

Spark read CSV file using Data Frame and query from PostgreSQL DB

Question

1 answers

solution1
0 2021-01-08 15:51:36

Spark read CSV file using Data Frame and query from PostgreSQL DB

Question

1 answers

solution1 0 2021-01-08 15:51:36

solution1
0 2021-01-08 15:51:36