如何在Spark数据框架中镜像架构

Question

I want to create a data frame which has schema of another data frame . 我想创建一个具有另一个数据框架架构的数据框架。

This is my code where I want to create mirror schema . 这是我要创建镜像架构的代码。

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

    import sqlContext.implicits._
    import org.apache.spark.{ SparkConf, SparkContext }
    import java.sql.{Date, Timestamp}
    import org.apache.spark.sql.Row
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions.udf


val getFFActionParent =  udf { (FFAction: String) => 
    if (FFAction=="Insert") "I|!|"
    else if (FFAction=="Overwrite") "I|!|"
    else "D|!|" 
}

val getFFActionChild =  udf { (FFAction: String) => 
    if (FFAction=="Insert") "I|!|"
    else if (FFAction=="Overwrite") "O|!|"
    else "D|!|" 
}

val dfContentEnvelope = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "env:ContentEnvelope").load("s3://trfsmallfffile/XML")


val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val FinancialSourceDF=dfContentItem.select($"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")
val FinancialSourceDFFinal=dfContentItem.select($"env:Data.sr:Source._organizationId".as("organizationId"), $"env:Data.sr:Source._sourceId".as("sourceId"), $"env:Data.sr:Source.*",getFFActionParent($"_action").as("FFAction")).filter($"env:Data.sr:Source._organizationId".isNotNull).drop("sr:Auditors")

The schema will be mirrored from below data frame 模式将从下面的数据帧进行镜像

val rdd1 = sc.textFile("s3://trfsmallfffile/FinancialSource/INCR")
val header1 = rdd1.filter(_.contains("Source.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val MirrorSchemaschema1 = StructType(header1.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)

Here I am trying to create mirror schema but not able to create it .. 在这里，我试图创建镜像架构，但无法创建它。

val FinancialSourceSaveDF=FinancialSourceDF.select(FinancialSourceDF.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source.").replace("sr:", ""))): _*)

FinancialSourceSaveDF.select(MirrorSchemaschema1.fieldNames.map(col): _*).show(false)

NOTE:The columns in FinancialSourceSaveDF data frame may change ie sometime all columns will be present sometime some of them will be present . 注意： FinancialSourceSaveDF数据框中的列可能会更改，即，有时所有列都将出现，有时某些列将出现。 null value will be filled for the non present columns . 将为不存在的列填充null值。

Answer 1

After you get dfContentItem , you can simply select the elements you require from Source struct and rename columns using alias as below. 获得dfContentItem ，您可以简单地从Source 结构中 select所需的元素，并使用alias重命名列 ，如下所示。

val dfContentItem = dfContentEnvelope.withColumn("column1", explode(dfContentEnvelope("env:Body.env:ContentItem"))).select("column1.*")
val dfType=dfContentItem.select($"env:Data.sr:Source.*", $"_action".as("FFAction|!|"))
val temp = dfType.select(dfType.columns.filter(x => !x.equals("sr:Auditors")).map(x => col(x).as(x.replace("_", "Source_").replace("sr:", ""))): _*)

you can used the udf function you've mentioned as you prefer 您可以根据需要使用提到的udf函数

Once you know you have all the columns required for schema columns read from 一旦知道了，就可以从中读取schema 列所需的所有列

Source.organizationId|^|Source.sourceId|^|FilingDateTime|^|SourceTypeCode|^|DocumentId|^|Dcn|^|DocFormat|^|StatementDate|^|IsFilingDateTimeEstimated|^|ContainsPreliminaryData|^|CapitalChangeAdjustmentDate|^|CumulativeAdjustmentFactor|^|ContainsRestatement|^|FilingDateTimeUTCOffset|^|ThirdPartySourceCode|^|ThirdPartySourcePriority|^|SourceTypeId|^|ThirdPartySourceCodeId|^|FFAction|!|

Now Your Main problem is that you don't know how many column names might be missing in the new dataframe from the referenced dataframe schema . 现在， 您的主要问题是，您不知道从引用的数据框架架构中的新数据框架中可能缺少多少列名称 。 For that you can find the difference between the column names and use foldLeft function to populate the missing columns with null values as 为此，您可以找到列名称之间的差异，并使用foldLeft函数以null值填充缺少的列，如下所示：

val diff = schema.fieldNames.diff(temp.schema.fieldNames)
val finaldf = diff.foldLeft(temp){(temp2df, colName) => temp2df.withColumn(colName, lit(null))}

You can just select to get the mirror schema as below 您可以select获取镜像架构 ，如下所示

finaldf.select(schema.fieldNames.map(col): _*).show(false)

You should have output as 您应该将输出为

+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|Source_organizationId|Source_sourceId|FilingDateTime         |SourceTypeCode|DocumentId|Dcn |DocFormat|StatementDate          |IsFilingDateTimeEstimated|ContainsPreliminaryData|CapitalChangeAdjustmentDate|CumulativeAdjustmentFactor|ContainsRestatement|FilingDateTimeUTCOffset|ThirdPartySourceCode|ThirdPartySourcePriority|SourceTypeId|ThirdPartySourceCodeId|FFAction|!||
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+
|4295906830           |344            |20171111T17:00:00+00:00|10K           |null      |null|null     |20171030T00:00:00+00:00|false                    |false                  |20171030T00:00:00+00:00    |1.0                       |false              |300                    |SS                  |1                       |3011835     |1000716240            |Overwrite  |
+---------------------+---------------+-----------------------+--------------+----------+----+---------+-----------------------+-------------------------+-----------------------+---------------------------+--------------------------+-------------------+-----------------------+--------------------+------------------------+------------+----------------------+-----------+

I hope the answer is helpful 我希望答案是有帮助的

如何在Spark数据框架中镜像架构

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-02-08 07:55:04

如何在Spark数据框架中镜像架构

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-02-08 07:55:04

解决方案1
2 已采纳 2018-02-08 07:55:04