[英]How to mirror schema in Spark data frame
I want to create a data frame which has schema of another data frame . 我想创建一个具有另一个数据框架架构的数据框架。
This is my code where I want to create mirror schema . 这是我要创建镜像架构的代码。
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.{ SparkConf, SparkContext }
import java.sql.{Date, Timestamp}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.udf
val getFFActionParent = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "I|!|"
else "D|!|"
}
val getFFActionChild = udf { (FFAction: String) =>
if (FFAction=="Insert") "I|!|"
else if (FFAction=="Overwrite") "O|!|"
else "D|!|"
}
val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val FinancialSourceDF=dfContentItem.select($"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")
val FinancialSourceDFFinal=dfContentItem.select($"env:Data.sr:Source._organizationId".as("organizationId"), $"env:Data.sr:Source._sourceId".as("sourceId"), $"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")
The schema will be mirrored from below data frame 模式将从下面的数据帧进行镜像
val rdd1 = sc.textFile("s3://trfsmallfffile/FinancialSource/INCR")
val header1 = rdd1.filter(_.contains("Source.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val MirrorSchemaschema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)
Here I am trying to create mirror schema but not able to create it .. 在这里,我试图创建镜像架构,但无法创建它。
val FinancialSourceSaveDF=FinancialSourceDF.select(FinancialSourceDF.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source.").replace("sr:", ""))): _*)
FinancialSourceSaveDF.select(MirrorSchemaschema1.fieldNames.map(col): _*).show(false)
NOTE:The columns in FinancialSourceSaveDF
data frame may change ie sometime all columns will be present sometime some of them will be present . 注意: FinancialSourceSaveDF
数据框中的列可能会更改,即,有时所有列都将出现,有时某些列将出现。 null value will be filled for the non present columns . 将为不存在的列填充null值。
After you get dfContentItem
, you can simply select
the elements you require from Source
struct and rename columns using alias
as below. 获得dfContentItem
,您可以简单地从Source
结构中 select
所需的元素 ,并使用alias
重命名列 ,如下所示。
val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val dfType=dfContentItem.select($"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
val temp = dfType.select(dfType.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source_").replace("sr:", ""))): _*)
you can used the udf function you've mentioned as you prefer 您可以根据需要使用提到的udf函数
Once you know you have all the columns required for schema
columns read from 一旦知道了,就可以从中读取schema
列所需的所有列
Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|
Now Your Main problem is that you don't know how many column names might be missing in the new dataframe from the referenced dataframe schema . 现在, 您的主要问题是,您不知道从引用的数据框架架构中的新数据框架中可能缺少多少列名称 。 For that you can find the difference between the column names and use foldLeft
function to populate the missing columns with null
values as 为此,您可以找到列名称之间的差异,并使用foldLeft
函数以null
值填充缺少的列,如下所示:
val diff = schema.fieldNames.diff(temp.schema.fieldNames)
val finaldf = diff.foldLeft(temp){(temp2df, colName) => temp2df.withColumn(colName, lit(null))}
You can just select
to get the mirror schema as below 您可以select
获取镜像架构 ,如下所示
finaldf.select(schema.fieldNames.map(col): _*).show(false)
You should have output as 您应该将输出为
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|Source_organizationId|Source_sourceId|FilingDateTime |SourceTypeCode|DocumentId|Dcn |DocFormat|StatementDate |IsFilingDateTimeEstimated|ContainsPreliminaryData|CapitalChangeAdjustmentDate|CumulativeAdjustmentFactor|ContainsRestatement|FilingDateTimeUTCOffset|ThirdPartySourceCode|ThirdPartySourcePriority|SourceTypeId|ThirdPartySourceCodeId|FFAction|!||
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|4295906830 |344 |20171111T17:00:00+00:00|10K |null |null|null |20171030T00:00:00+00:00|false |false |20171030T00:00:00+00:00 |1.0 |false |300 |SS |1 |3011835 |1000716240 |Overwrite |
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
I hope the answer is helpful 我希望答案是有帮助的
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.