![](/img/trans.png)
[英]Spark Scala dynamic comparing of all similar dataframe columns after JOIN
[英]Why outer join does not preserve all mentioned columns after data frame join in scala spark?
我有兩個要在其上執行外部聯接的數據框。 數據框1數據集是這樣的
Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|
4295876791|^|162|^|2017-08-10T06:01:00Z|^|YUH|^|44604379|^|yo00196838|^|PDFNTV|^|2017-06-30T00:00:00Z|^|False|^|False|^|2017-06-30T00:00:00Z|^|1.00000|^|False|^|540|^|SS |^|1|^|3013057|^|1000716240|^|I|!|
4295877415|^|167|^|2005-03-31T00:00:00Z|^|ESGWEB|^||^||^||^|2005-03-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|168|^|2010-03-31T00:00:00Z|^|ESGWEB|^||^||^||^|2010-03-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|169|^|2007-03-31T00:00:00Z|^|ESGWEB|^||^||^||^|2007-03-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|170|^|2014-12-31T00:00:00Z|^|ESGWEB|^||^||^||^|2014-12-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|171|^|2012-12-31T00:00:00Z|^|ESGWEB|^||^||^||^|2012-12-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|172|^|2009-03-31T00:00:00Z|^|ESGWEB|^||^||^||^|2009-03-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|194|^|2015-03-30T00:00:00Z|^|ESGWEB|^||^||^||^|2013-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|195|^|2008-05-06T00:00:00Z|^|ESGWEB|^||^||^||^|2008-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|214|^|2012-03-08T00:00:00Z|^|ESGWEB|^||^||^||^|2011-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|215|^|2004-06-30T00:00:00Z|^|ESGWEB|^||^||^||^|2004-01-01T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|216|^|2012-06-25T00:00:00Z|^|ESGWEB|^||^||^||^|2011-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|217|^|2014-01-14T00:00:00Z|^|ESGWEB|^||^||^||^|2014-01-05T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|218|^|2008-05-09T00:00:00Z|^|ESGWEB|^||^||^||^|2007-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|219|^|2010-12-09T00:00:00Z|^|ESGWEB|^||^||^||^|2011-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|220|^|2011-06-29T00:00:00Z|^|ESGWEB|^||^||^||^|2011-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|221|^|2013-06-29T00:00:00Z|^|ESGWEB|^||^||^||^|2014-01-05T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|222|^|2015-02-23T00:00:00Z|^|ESGWEB|^||^||^||^|2014-01-05T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|223|^|2013-05-31T00:00:00Z|^|ESGWEB|^||^||^||^|2014-01-05T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|224|^|2012-03-20T00:00:00Z|^|ESGWEB|^||^||^||^|2011-01-31T00:00:00Z|^|True|^|False|^||^||^|False|^|0|^|ESG|^||^|1002198005|^||^|I|!|
4295877415|^|229|^|2015-12-31T00:00:00Z|^|ESGWEB|^||^||^||^|2015-12-31T00:00:00Z|^|True|^|False|^||^|1.00000|^|False|^|0|^|ATD|^||^|1002198005|^||^|I|!|
數據框2如下
DataPartition_1|^|TimeStamp|^|Source.organizationId|^|Source.sourceId|^|FilingDateTime_1|^|SourceTypeCode_1|^|DocumentId_1|^|Dcn_1|^|DocFormat_1|^|StatementDate_1|^|IsFilingDateTimeEstimated_1|^|ContainsPreliminaryData_1|^|CapitalChangeAdjustmentDate_1|^|CumulativeAdjustmentFactor_1|^|ContainsRestatement_1|^|FilingDateTimeUTCOffset_1|^|ThirdPartySourceCode_1|^|ThirdPartySourcePriority_1|^|SourceTypeId_1|^|ThirdPartySourceCodeId_1|^|FFAction_1
SelfSourcedPublic|^|1512723204932|^|4295859031|^|59|^|2017-04-04T18:00:00+00:00|^|10Q|^|null|^|null|^|null|^|2017-03-31T00:00:00+00:00|^|false|^|false|^|2017-03-31T00:00:00+00:00|^|1.00000|^|false|^|-360|^|SS|^|1|^|3011836|^|1000716240|^|I|!|
這是兩個數據框的架構
第一個架構根
|-- Source_organizationId: long (nullable = true)
|-- Source_sourceId: integer (nullable = true)
|-- FilingDateTime: string (nullable = true)
|-- SourceTypeCode: string (nullable = true)
|-- DocumentId: integer (nullable = true)
|-- Dcn: string (nullable = true)
|-- DocFormat: string (nullable = true)
|-- StatementDate: string (nullable = true)
|-- IsFilingDateTimeEstimated: boolean (nullable = true)
|-- ContainsPreliminaryData: boolean (nullable = true)
|-- CapitalChangeAdjustmentDate: string (nullable = true)
|-- CumulativeAdjustmentFactor: string (nullable = true)
|-- ContainsRestatement: boolean (nullable = true)
|-- FilingDateTimeUTCOffset: integer (nullable = true)
|-- ThirdPartySourceCode: string (nullable = true)
|-- ThirdPartySourcePriority: integer (nullable = true)
|-- SourceTypeId: integer (nullable = true)
|-- ThirdPartySourceCodeId: integer (nullable = true)
|-- FFAction: string (nullable = true)
|-- DataPartition: string (nullable = true)
Second schema root
|-- DataPartition_1: string (nullable = true)
|-- Source_organizationId: long (nullable = true)
|-- Source_sourceId: integer (nullable = true)
|-- FilingDateTime_1: string (nullable = true)
|-- SourceTypeCode_1: string (nullable = true)
|-- DocumentId_1: string (nullable = true)
|-- Dcn_1: string (nullable = true)
|-- DocFormat_1: string (nullable = true)
|-- StatementDate_1: string (nullable = true)
|-- IsFilingDateTimeEstimated_1: boolean (nullable = true)
|-- ContainsPreliminaryData_1: boolean (nullable = true)
|-- CapitalChangeAdjustmentDate_1: string (nullable = true)
|-- CumulativeAdjustmentFactor_1: string (nullable = true)
|-- ContainsRestatement_1: boolean (nullable = true)
|-- FilingDateTimeUTCOffset_1: integer (nullable = true)
|-- ThirdPartySourceCode_1: string (nullable = true)
|-- ThirdPartySourcePriority_1: integer (nullable = true)
|-- SourceTypeId_1: integer (nullable = true)
|-- ThirdPartySourceCodeId_1: integer (nullable = true)
|-- FFAction_1: string (nullable = true)
現在,當我執行外部聯接時,在輸出字段中缺少幾列
這是示例輸出
Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|
4295877415|^|217|^|2014-01-14T00:00:00Z|^|ESGWEB|^||^||^||^|2014-01-05T00:00:00Z|^|true|^|false|^||^||^|false|^|0|^|ESG|^|1002198005|^|I|!|
4295877415|^|171|^|2012-12-31T00:00:00Z|^|ESGWEB|^||^||^||^|2012-12-31T00:00:00Z|^|true|^|false|^||^||^|false|^|0|^|ESG|^|1002198005|^|I|!|
4295877415|^|167|^|2005-03-31T00:00:00Z|^|ESGWEB|^||^||^||^|2005-03-31T00:00:00Z|^|true|^|false|^||^||^|false|^|0|^|ESG|^|1002198005|^|I|!|
4295877415|^|219|^|2010-12-09T00:00:00Z|^|ESGWEB|^||^||^||^|2011-01-31T00:00:00Z|^|true|^|false|^||^||^|false|^|0|^|ESG|^|1002198005|^|I|!|
所以這里在第一個數據幀中空白的地方缺少ThirdPartySourceCodeId
和ThirdPartySourcePriority
,例如在第一個數據的第二行中。
在第一個數據幀中有19列,但在輸出中我只有17列。
這是生成輸出的完整代碼
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
import org.apache.spark.sql.functions._
val get_cus_val = spark.udf.register("get_cus_val", (filePath: String) => filePath.split("\\.")(3))
val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialSource/MAIN")
val df1With_ = df.toDF(df.columns.map(_.replace(".", "_")): _*)
val column_to_keep = df1With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df1result = df1With_.select(column_to_keep.head, column_to_keep.tail: _*)
val df1resultFinal=df1result.withColumn("DataPartition", get_cus_val(input_file_name))
val df1resultFinalwithTimestamp=df1resultFinal
.withColumn("FilingDateTime",date_format(col("FilingDateTime"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate",date_format(col("StatementDate"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate",date_format(col("CapitalChangeAdjustmentDate"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor", format_number(col("CumulativeAdjustmentFactor"), 5))
val df2 = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialSource/INCR")
val df2With_ = df2.toDF(df2.columns.map(_.replace(".", "_")): _*)
val df2column_to_keep = df2With_.columns.filter(v => (!v.contains("^") && !v.contains("!") && !v.contains("_c"))).toSeq
val df2result = df2With_.select(df2column_to_keep.head, df2column_to_keep.tail: _*)
val df2resultTimestamp=df2result
.withColumn("FilingDateTime_1",date_format(col("FilingDateTime_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("StatementDate_1",date_format(col("StatementDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CapitalChangeAdjustmentDate_1",date_format(col("CapitalChangeAdjustmentDate_1"), "yyyy-MM-dd'T'HH:mm:ss'Z'"))
.withColumn("CumulativeAdjustmentFactor_1", format_number(col("CumulativeAdjustmentFactor_1"), 5))
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Source_organizationId", "Source_sourceId").orderBy($"TimeStamp".cast(LongType).desc)
val latestForEachKey = df2resultTimestamp.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
val dfMainOutput = df1resultFinalwithTimestamp.join(latestForEachKey, Seq("Source_organizationId", "Source_sourceId"), "outer")
.select($"Source_organizationId", $"Source_sourceId",
when($"FilingDateTime_1".isNotNull, $"FilingDateTime_1").otherwise($"FilingDateTime").as("FilingDateTime"),
when($"SourceTypeCode_1".isNotNull, $"SourceTypeCode_1").otherwise($"SourceTypeCode").as("SourceTypeCode"),
when($"DocumentId_1".isNotNull, $"DocumentId_1").otherwise($"DocumentId").as("DocumentId"),
when($"Dcn_1".isNotNull, $"Dcn_1").otherwise($"Dcn").as("Dcn"),
when($"DocFormat_1".isNotNull, $"DocFormat_1").otherwise($"DocFormat").as("DocFormat"),
when($"StatementDate_1".isNotNull, $"StatementDate_1").otherwise($"StatementDate").as("StatementDate"),
when($"IsFilingDateTimeEstimated_1".isNotNull, $"IsFilingDateTimeEstimated_1").otherwise($"IsFilingDateTimeEstimated").as("IsFilingDateTimeEstimated"),
when($"ContainsPreliminaryData_1".isNotNull, $"ContainsPreliminaryData_1").otherwise($"ContainsPreliminaryData").as("ContainsPreliminaryData"),
when($"CapitalChangeAdjustmentDate_1".isNotNull, $"CapitalChangeAdjustmentDate_1").otherwise($"CapitalChangeAdjustmentDate").as("CapitalChangeAdjustmentDate"),
when($"CumulativeAdjustmentFactor_1".isNotNull, $"CumulativeAdjustmentFactor_1").otherwise($"CumulativeAdjustmentFactor").as("CumulativeAdjustmentFactor"),
when($"ContainsRestatement_1".isNotNull, $"ContainsRestatement_1").otherwise($"ContainsRestatement").as("ContainsRestatement"),
when($"FilingDateTimeUTCOffset_1".isNotNull, $"FilingDateTimeUTCOffset_1").otherwise($"FilingDateTimeUTCOffset").as("FilingDateTimeUTCOffset"),
when($"ThirdPartySourceCode_1".isNotNull, $"ThirdPartySourceCode_1").otherwise($"ThirdPartySourceCode").as("ThirdPartySourceCode"),
when($"ThirdPartySourcePriority_1".isNotNull, $"ThirdPartySourcePriority_1").otherwise($"ThirdPartySourcePriority").as("ThirdPartySourcePriority"),
when($"SourceTypeId_1".isNotNull, $"SourceTypeId_1").otherwise($"SourceTypeId").as("SourceTypeId"),
when($"ThirdPartySourceCodeId_1".isNotNull, $"ThirdPartySourceCodeId_1").otherwise($"ThirdPartySourceCodeId").as("ThirdPartySourceCodeId"),
when($"FFAction_1".isNotNull, concat(col("FFAction_1"), lit("|!|"))).otherwise(concat(col("FFAction"), lit("|!|"))).as("FFAction"),
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition").as("DataPartition"))
.filter(!$"FFAction".contains("D"))
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
val headerColumn = df.columns.filter(v => (!v.contains("^") && !v.contains("_c"))).toSeq
val header = headerColumn.dropRight(1).mkString("", "|^|", "|!|")
val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "null", "")).withColumnRenamed("concatenated", header)
dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition")
.format("csv")
.option("nullValue", "")
.option("delimiter", ";")
.option("header", "true")
.option("codec", "gzip")
.save("s3://trfsmallfffile/FinancialSource/output")
integerType
之后,您的這兩列ThirdPartySourceCodeId
和ThirdPartySourcePriority
均為integerType
,因此na.fill("")
不適用於它們,因此當您使用concat
,所有null
integer
數值都將被filtered
掉。
解決你的問題是兩列撒入stringType
申請前na.fill
所以改變
val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
至
val dfMainOutputFinal = dfMainOutput.withColumn("ThirdPartySourcePriority", $"ThirdPartySourcePriority".cast(StringType)).withColumn("ThirdPartySourceCodeId", $"ThirdPartySourceCodeId".cast(StringType)).na.fill("").select($"DataPartition",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))
應該解決你的問題
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.