I'm new to Spark, I'm loading a huge CSV file using Data Frame code given below
Dataset<Row> df = sqlContext.read().format("com.databricks.spark.csv").schema(customSchema)
.option("delimiter", "|").option("header", true).load(inputDataPath);
Now after loading CSV data in data frame, now I want to iterate through each row and based on some columns want to query from PostgreSQL DB (performing some geometry operation). Later want to merge some fields returned from DB with the data frame records. What's the best way to do it, consider huge amount of records. Any help appreciated. I'm using Java.
Like @mck also pointed out: the best way is to use join. with spark you can read external jdbc table using the DataRame Api for example
val props = Map(....)
spark.read.format("jdbc").options(props).load()
see the DataFrameReader scaladoc for more options and which properties and values you need to set.
then use join to merge fields
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.