[英]Scala/Spark : How to do outer join based on common columns
我有2个数据数据框:
第一个数据帧包含温度信息。
第二个数据帧包含降水信息”
我读取了这些文件并创建了如下数据框:
val dataRecordsTemp = sc.textFile(tempFile).map{rec=>
val splittedRec = rec.split("\\s+")
Temparature(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForTemp = Seq("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP")
val schemaTemp = StructType(headerFieldsForTemp.map{f => StructField(f, StringType, nullable=true)})
val dfTemp = session.createDataFrame(dataRecordsTemp,schemaTemp)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing temparature data ...............................")
dfTemp.select("YEAR","MONTH","DAY","MAX_TEMP","MIN_TEMP").take(10).foreach(println)
val dataRecordsPrecip = sc.textFile(precipFile).map{rec=>
val splittedRec = rec.split("\\s+")
Precipitation(splittedRec(0),splittedRec(1),splittedRec(2),splittedRec(3),splittedRec(4),splittedRec(5))
}.map{x => Row.fromSeq(x.getDataFields())}
val headerFieldsForPrecipitation = Seq("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER")
val schemaPrecip = StructType(headerFieldsForPrecipitation.map{f => StructField(f, StringType, nullable=true)})
val dfPrecip = session.createDataFrame(dataRecordsPrecip,schemaPrecip)
.orderBy(desc("year"), desc("month"), desc("day"))
println("Printing precipitation data ...............................")
dfPrecip.select("YEAR","MONTH","DAY","PRECIPITATION","SNOW","SNOW_COVER").take(10).foreach(println)
我必须基于通用列(年,月,日)加入2个RDD。 输入文件具有标题,输出文件也具有标题。第一个文件具有有关温度的信息(例如):
year month day min-temp mav-temp
2017 12 13 13 25
2017 12 16 25 32
2017 12 25 34 56
第二个文件具有信息沉淀(例如)
year month day precipitation snow snow-cover
2018 7 6 0.00 0.0 0
2017 12 13 0.04 0.0 0
2017 12 16 0.4 0.04 1
我的预期输出应该是( 按日期异步排序,如果找不到值,则为空白 ):
year month day min-temp mav-temp precipitation snow snow-cover
2017 12 13 13 25 0.04 0.0 0
2017 12 16 25 32 0.4 0.04 1
2017 12 25 34 56
2018 7 6 0.00 0.0 0
我可以在Scala中获得帮助吗?
您需要外部连接这两个数据集,然后按以下顺序排序结果:
import org.apache.spark.sql.functions._
dfTemp
.join(dfPrecip, Seq("year", "month", "day"), "outer")
.orderBy(desc("year"), desc("month"), desc("day"))
.na.fill("")
如果不需要空白值并且可以使用null
,则可以避免使用.na.fill("")
。
希望能帮助到你!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.