简体   繁体   English

如何为所有列编写withColumnRenamed并在Spark数据帧的自定义分区中加入两个不同的架构

[英]How to write withColumnRenamed for all columns and join two different schema in custom partition in spark data frame

Hi i have two text file and i have to join this two text file to create unique one . 嗨,我有两个文本文件,我必须将这两个文本文件加入以创建唯一的文件。 I have used data frame in spark to achieve that . 我已经在spark中使用数据框来实现这一目标。

Both text file has same structure except some fields . 除某些字段外,两个文本文件的结构相同。

Now i have to create data frame and join both data frame . 现在,我必须创建数据框架并加入两个数据框架。

Question 1: How do we join both data frame that has some extra fields . 问题1:我们如何将这两个具有一些额外字段的数据框连接起来。 for example my schema has first filed as TimeStamp but my first dataFrame does not have TimeStamp field . 例如我的模式首先被归档为TimeStamp,但是我的第一个dataFrame没有TimeStamp字段。

Question 2: In my code i have to rename all column in order to select column after join and i have 29 columns so i have to write rename function 29 times .Is there any way i can do that without writing so many times . 问题2:在我的代码中,我必须重命名所有列,以便在连接后选择列,并且我有29列,因此我必须编写29次重命名函数。有什么办法可以做到而无需编写多次。

Question 3: After Joining i have to save output as based on some filed . 问题3:加入后,我必须将输出保存为基于某些字段。 For example if StatementTypeCode is BAL then all records belonging to BAL will go to one file like that,same as custom partition in map reduce . 例如,如果StatementTypeCode为BAL,则属于BAL的所有记录将进入一个文件,与map reduce中的自定义分区相同。

this is what i have tried latestForEachKey.write.partitionBy("StatementTypeCode") i hope it should be correct .. 这就是我尝试过的latestForEachKey.write.partitionBy("StatementTypeCode")我希望它应该是正确的。

I know i have asked so many question in one post .I am learning spark scala so facing issue in every syntax and every concept . 我知道我在一个帖子中问了很多问题。我正在学习spark scala,因此在每个语法和每个概念中都面临着问题。 I hope my question is clear . 我希望我的问题清楚。

Here is my code for what i am doing right now . 这是我现在正在执行的代码。

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
        import sqlContext.implicits._

        import org.apache.spark.{ SparkConf, SparkContext }
        import java.sql.{Date, Timestamp}
        import org.apache.spark.sql.Row
        import org.apache.spark.sql.types.{ StructType, StructField, StringType, DoubleType, IntegerType,TimestampType }
        import org.apache.spark.sql.functions.udf

       val schema = StructType(Array(

    StructField("TimeStamp", StringType),
    StructField("LineItem_organizationId", StringType),
    StructField("LineItem_lineItemId", StringType),
    StructField("StatementTypeCode", StringType),
    StructField("LineItemName", StringType),
    StructField("LocalLanguageLabel", StringType),
    StructField("FinancialConceptLocal", StringType),
    StructField("FinancialConceptGlobal", StringType),
    StructField("IsDimensional", StringType),
    StructField("InstrumentId", StringType),
    StructField("LineItemLineItemName", StringType),
    StructField("PhysicalMeasureId", StringType),
    StructField("FinancialConceptCodeGlobalSecondary", StringType),
    StructField("IsRangeAllowed", StringType),
    StructField("IsSegmentedByOrigin", StringType),
    StructField("SegmentGroupDescription", StringType),
    StructField("SegmentChildDescription", StringType),
    StructField("SegmentChildLocalLanguageLabel", StringType),
    StructField("LocalLanguageLabel_languageId", StringType),
    StructField("LineItemName_languageId", StringType),
    StructField("SegmentChildDescription_languageId", StringType),
    StructField("SegmentChildLocalLanguageLabel_languageId", StringType),
    StructField("SegmentGroupDescription_languageId", StringType),
    StructField("SegmentMultipleFundbDescription", StringType),
    StructField("SegmentMultipleFundbDescription_languageId", StringType),
    StructField("IsCredit", StringType),
    StructField("FinancialConceptLocalId", StringType),
    StructField("FinancialConceptGlobalId", StringType),
    StructField("FinancialConceptCodeGlobalSecondaryId", StringType),
    StructField("FFFFAction", StringType)))


       val textRdd1 = sc.textFile("s3://trfsdisu/SPARK/Main.txt")
        val rowRdd1 = textRdd1.map(line => Row.fromSeq(line.split("\\|\\^\\|", -1)))
        var df1 = sqlContext.createDataFrame(rowRdd1, schema).drop("index")

        val textRdd2 = sc.textFile("s3://trfsdisu/SPARK/Incr.txt")
        val rowRdd2 = textRdd2.map(line => Row.fromSeq(line.split("\\|\\^\\|", -1)))
        var df2 = sqlContext.createDataFrame(rowRdd2, schema)

        // df2.show(false) 

        import org.apache.spark.sql.expressions._
        val windowSpec = Window.partitionBy("LineItem_organizationId", "LineItem_lineItemId").orderBy($"TimeStamp".cast(TimestampType).desc) 

        val latestForEachKey = df2.withColumn("rank", rank().over(windowSpec)).filter($"rank" === 1).drop("rank", "TimeStamp")
        .withColumnRenamed("StatementTypeCode", "StatementTypeCode_1").withColumnRenamed("LineItemName", "LineItemName_1").withColumnRenamed("FFAction", "FFAction_1")

  //This is where i need help withColumnRenamed part 


    val df3 = df1.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
          .select($"LineItem_organizationId", $"LineItem_lineItemId",
            when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
            when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
            when($"FFAction_1".isNotNull, $"FFAction_1").otherwise($"FFAction").as("FFAction")).filter(!$"FFAction".contains("D"))

        df3.show()

模式部分可以这样解决

val df1 = sqlContext.createDataFrame(rowRdd1, new StructType(schema.tail.toArray))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不对每个列进行硬编码的情况下为联接中的两个数据帧创建匹配模式 - How to make matching schema for two data frame in join without hard coding for every columns 为什么在Scala Spark中进行数据帧连接后,外部连接不保留所有提到的列? - Why outer join does not preserve all mentioned columns after data frame join in scala spark? 比较spark中两个数据框中的列 - Comparing columns in two data frame in spark 将Spark数据框中的每个分区记录写入xml文件 - Write records per partition in spark data frame to a xml file 如何在Spark数据框架中镜像架构 - How to mirror schema in Spark data frame 如何在多列上对数据帧进行分区并将输出写入 Apache Spark 中的 xlsx - How to partition a dataframe on multiple columns and write the output to xlsx in Apache Spark 使用模式中的所有键将 spark 数据集写入 json,包括 null 列 - write a spark Dataset to json with all keys in the schema, including null columns 在Spark Scala中将两个文本文件的架构不同的一列合并 - Join two text file with one column different in their schema in spark scala 分组并依靠Spark Data框架的所有列 - Group by and count on Spark Data frame all columns 如何使用Spark在数据框中创建架构数组 - How to create schema Array in data frame with spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM