简体   繁体   English

如何在Scala / Spark数据框中的每一行使用带有条件的withColumn

[英]How to use withColumn with condition for the each row in Scala / Spark data frame

I have data frame with below format 我有以下格式的数据框

+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+
|DataPartition    |TimeStamp                |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+
|SelfSourcedPublic|2018-03-05T11:54:18+00:00|I|!|       |1900-01-01T00:00:00+00:00    |9999-12-31T00:00:00+00:00  |4295903126                        |404010                                |320150                          |
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+

I want to add extra columns with condition on below column 我想在下面的列中添加带条件的额外列

IdentifierValue_identifierEntityTypeId

Add extra columns partition with below condition 使用以下条件添加额外的列分区

if IdentifierValue_identifierEntityTypeId =1001371402 then partition =Repno2FundamentalSeries else if IdentifierValue_identifierEntityTypeId404010 then partition= Repno2Organization 如果IdentifierValue_identifierEntityTypeId = 1001371402则partition = Repno2FundamentalSeries else if IdentifierValue_identifierEntityTypeId404010 then partition = Repno2Organization

This is what I am trying to achieve that 这就是我想要实现的目标

 val temp = temp1.withColumn("Partition", when($"IdentifierValue_identifierEntityTypeId" === "404010", 0).otherwise("Repno2FundamentalSeries"))
    temp.show(false)

And I am getting below output which but getting value as zero 而我正在低于输出,但得到的价值为零

+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+---------+
|DataPartition    |TimeStamp                |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|Partition|
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+---------+
|SelfSourcedPublic|2018-03-05T11:54:18+00:00|I|!|       |1900-01-01T00:00:00+00:00    |9999-12-31T00:00:00+00:00  |4295903126                        |404010                                |320150                          |0        |
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+---------+

I am new to scala so asling basic question 我是scala的新手,因此提出了基本问题

For multiple condition on columns how to write when and Otherwise . 对于列上的多个条件如何写入和否则。 This is not working for me Getting error like 这对我不起作用

Exception in thread "main" java.lang.IllegalArgumentException: otherwise() can only be applied once on a Column previously generated by when() 线程“main”中的异常java.lang.IllegalArgumentException:otherwise()只能在之前由when()生成的列上应用一次

val dataMain = dataMain1.withColumn(
      "Partition",
      when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Instrument2Fundamental")
        .otherwise(when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Instrument2FundamentalSeries"))
        .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Organization2Fundamental"))
        .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Organization2FundamentalSeries"))
        )

According to the condition you provided, you should change the when condition as below. 根据您提供的条件,您应该更改when条件,如下所示。

if IdentifierValue_identifierEntityTypeId =1001371402 then partition =Repno2FundamentalSeries else if IdentifierValue_identifierEntityTypeId404010 then partition= Repno2Organization 如果IdentifierValue_identifierEntityTypeId = 1001371402则partition = Repno2FundamentalSeries else if IdentifierValue_identifierEntityTypeId404010 then partition = Repno2Organization

df1.withColumn("Partition",
  when($"IdentifierValue_identifierEntityTypeId" === "1001371402", "Repno2FundamentalSeries")
    .otherwise("Repno2Organization")
)

Output: 输出:

+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+
|DataPartition    |TimeStamp                |FFAction|!||IdentifierValue_effectiveFrom|IdentifierValue_effectiveTo|IdentifierValue_identifierEntityId|IdentifierValue_identifierEntityTypeId|IdentifierValue_identifierTypeId|Partition              |
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+
|SelfSourcedPublic|2018-03-05T11:54:18+00:00|I||!       |1900-01-01T00:00:00+00:00    |9999-12-31T00:00:00+00:00  |4295903126                        |404010                                |320150                          |Repno2FundamentalSeries|
+-----------------+-------------------------+-----------+-----------------------------+---------------------------+----------------------------------+--------------------------------------+--------------------------------+-----------------------+

EDIT: 编辑:

Here is how you write nested When 这是你如何编写嵌套的When

val dataMain = df.withColumn(
"Partition",
when(($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "Fundamental"), "Instrument2Fundamental")
  .otherwise(
    when($"RelationObjectId_relatedObjectType" === "EDInstrument" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Instrument2FundamentalSeries")
      .otherwise(
        when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "Fundamental", "Organization2Fundamental")
          .otherwise(when($"RelationObjectId_relatedObjectType" === "Organization" && $"RelationObjectId_relatedObjectType" === "FundamentalSeries", "Organization2FundamentalSeries")
          )
      )
  )

)

Hope this helps 希望这可以帮助

One alternative way to implement this is that; 实现这一目标的另一种方法是: you could use sql like CASE WHEN statement instead of using WithColumn 你可以使用像CASE WHEN语句这样的sql而不是使用WithColumn

This might be easier to code for if you are familiar with sql 如果您熟悉sql,这可能更容易编码

Eg. 例如。

       val dataMain = dataMain1.selectExpr("*", 
       """CASE WHEN RelationObjectId_relatedObjectType = 'EDInstrument' 
       THEN 'Instrument2Fundamental'
       WHEN cond2 
       THEN value2
       ELSE defaultValue end AS partition""")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM