简体   繁体   中英

Spark read CSV file using Data Frame and query from PostgreSQL DB

I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below

Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
                .option("delimiter", "|").option("header", true).load(inputDataPath);

Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records. Any help appreciated. I'm using Java.

Like @mck also pointed out: the best way is to use join. with spark you can read external jdbc table using the DataRame Api for example

val props = Map(....)
spark.read.format("jdbc").options(props).load()

see the DataFrameReader scaladoc for more options and which properties and values you need to set.

then use join to merge fields

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM