如何為所有列編寫withColumnRenamed並在Spark數據幀的自定義分區中加入兩個不同的架構

Question

嗨，我有兩個文本文件，我必須將這兩個文本文件加入以創建唯一的文件。 我已經在spark中使用數據框來實現這一目標。

除某些字段外，兩個文本文件的結構相同。

現在，我必須創建數據框架並加入兩個數據框架。

問題1：我們如何將這兩個具有一些額外字段的數據框連接起來。 例如我的模式首先被歸檔為TimeStamp，但是我的第一個dataFrame沒有TimeStamp字段。

問題2：在我的代碼中，我必須重命名所有列，以便在連接后選擇列，並且我有29列，因此我必須編寫29次重命名函數。有什么辦法可以做到而無需編寫多次。

問題3：加入后，我必須將輸出保存為基於某些字段。 例如，如果StatementTypeCode為BAL，則屬於BAL的所有記錄將進入一個文件，與map reduce中的自定義分區相同。

這就是我嘗試過的latestForEachKey.write.partitionBy("StatementTypeCode")我希望它應該是正確的。

我知道我在一個帖子中問了很多問題。我正在學習spark scala，因此在每個語法和每個概念中都面臨着問題。 我希望我的問題清楚。

這是我現在正在執行的代碼。

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
        import sqlContext.implicits._

        import org.apache.spark.{ SparkConf, SparkContext }
        import java.sql.{Date, Timestamp}
        import org.apache.spark.sql.Row
        import org.apache.spark.sql.types.{ StructType, StructField, StringType, DoubleType, IntegerType,TimestampType }
        import org.apache.spark.sql.functions.udf

       val schema = StructType(Array(

    StructField("TimeStamp", StringType),
    StructField("LineItem_organizationId", StringType),
    StructField("LineItem_lineItemId", StringType),
    StructField("StatementTypeCode", StringType),
    StructField("LineItemName", StringType),
    StructField("LocalLanguageLabel", StringType),
    StructField("FinancialConceptLocal", StringType),
    StructField("FinancialConceptGlobal", StringType),
    StructField("IsDimensional", StringType),
    StructField("InstrumentId", StringType),
    StructField("LineItemLineItemName", StringType),
    StructField("PhysicalMeasureId", StringType),
    StructField("FinancialConceptCodeGlobalSecondary", StringType),
    StructField("IsRangeAllowed", StringType),
    StructField("IsSegmentedByOrigin", StringType),
    StructField("SegmentGroupDescription", StringType),
    StructField("SegmentChildDescription", StringType),
    StructField("SegmentChildLocalLanguageLabel", StringType),
    StructField("LocalLanguageLabel_languageId", StringType),
    StructField("LineItemName_languageId", StringType),
    StructField("SegmentChildDescription_languageId", StringType),
    StructField("SegmentChildLocalLanguageLabel_languageId", StringType),
    StructField("SegmentGroupDescription_languageId", StringType),
    StructField("SegmentMultipleFundbDescription", StringType),
    StructField("SegmentMultipleFundbDescription_languageId", StringType),
    StructField("IsCredit", StringType),
    StructField("FinancialConceptLocalId", StringType),
    StructField("FinancialConceptGlobalId", StringType),
    StructField("FinancialConceptCodeGlobalSecondaryId", StringType),
    StructField("FFFFAction", StringType)))


       val textRdd1 = sc.textFile("s3://trfsdisu/SPARK/Main.txt")
        val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split("\\|\\^\\|", -1)))
        var df1 = sqlContext.createDataFrame(rowRdd1, schema).drop("index")

        val textRdd2 = sc.textFile("s3://trfsdisu/SPARK/Incr.txt")
        val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split("\\|\\^\\|", -1)))
        var df2 = sqlContext.createDataFrame(rowRdd2, schema)

        // df2.show(false) 

        import org.apache.spark.sql.expressions._
        val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(TimestampType).desc) 

        val latestForEachKey = df2.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
        .withColumnRenamed("StatementTypeCode", "StatementTypeCode_1").withColumnRenamed("LineItemName", "LineItemName_1").withColumnRenamed("FFAction", "FFAction_1")

  //This is where i need help withColumnRenamed part 


    val df3 = df1.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
          .select($"LineItem_organizationId", $"LineItem_lineItemId",
            when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
            when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
            when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction").as("FFAction")).filter(!$"FFAction".contains("D"))

        df3.show()

Answer 1

模式部分可以這樣解決

val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))

如何為所有列編寫withColumnRenamed並在Spark數據幀的自定義分區中加入兩個不同的架構

問題描述

1 個解決方案

解決方案1
0

如何為所有列編寫withColumnRenamed並在Spark數據幀的自定義分區中加入兩個不同的架構

問題描述

1 個解決方案

解決方案1 0

解決方案1
0