[英]Spark Scala : Add new column without looping throw same table many times
我想添加一個名為activeIOsAtSite的新列,其值為“ Y ”或“ N ”。
該值將根據以下條件添加:
對於下表行中的每一列SITE_SITE_ID ; 如果SITE_SITE_ID具有:
(APPLICATION_SOURCE == 'CIBASE' AND STANDARD_STATUS =='ACTIVE')
在這種情況下,該值應為“Y”,否則為“N”
是否有一種方法可以做到這一點,而無需一遍又一遍地迭代同一個表(對於每一行),因為我擁有的表是如此之大,我需要以最快的方式做到這一點?
期望結果的示例:
我試圖做類似的事情,但我不確定它是否正確:
finvInventoryAllDf
.withColumn(
"activeIOsAtSite",
activeIOsAtSiteGenerator(finvInventoryAllDf, col("Site_siteId"))
)
使用activeIOsAtSiteGenerator()
是 function 我驗證上述條件:
def activeIOsAtSiteGenerator(dataFrame: DataFrame, site_siteId: Column): Column = {
val count = dataFrame
.where(col("Site_siteId") === site_siteId)
.where("InstalledOffer_installedOfferId IS NOT NULL AND InstalledOffer_installedOfferId NOT IN ('','null','NULL') AND UPPER(InstalledOffer_standardStatus) IN ('ACTIVE') AND UPPER(InstalledOffer_applicationSource) IN('CIBASE')")
.count()
if (count > 0)
lit("Y")
else
lit("N")
}
您可以先groupBy
唯一 ID,然后collect_set
檢查該列是否包含您提到的任何組合。
var grouped = df
.groupBy("SITE_SITE_ID").agg(collect_set(array("APPLICATION_SOURCE", "STANDARD_STATUS")).as("array"))
.withColumn("indicator",
expr("transform(array, x -> array_contains(x, 'CIBASE') and array_contains(x, 'ACTIVE'))")
)
如果訂單很重要:
.withColumn("indicator",
expr("transform(array, x -> lower(element_at(x, 1)) = 'cibase' and lower(element_at(x, 2)) = 'active')")
)
我們現有的形式:
+------------+------------------------------------+-------------+
|SITE_SITE_ID|array |indicator |
+------------+------------------------------------+-------------+
|si_2 |[[SLOW, STASH]] |[false] | <- make this N
|si_3 |[[MEDIUM, TREE]] |[false] | <- make this N
|si_1 |[[FAST, DISABLED], [CIBASE, ACTIVE]]|[false, true]| <- make this Y (pair found)
+------------+------------------------------------+-------------+
然后我們繼續:
grouped = grouped.withColumn("indicator",
when(array_contains(col("indicator"), true), "Y").otherwise("N")
)
.drop("array")
+------------+---------+
|SITE_SITE_ID|indicator|
+------------+---------+
|si_2 |N |
|si_3 |N |
|si_1 |Y |
+------------+---------+
collected_set
返回一個arrays數組,這就是我們檢查combo的原因,我們再次檢查,如果數組中有一個為true(找到了combo),則返回Y
,否則返回N
; 最后,我們刪除array
列。
分組樣本:
+------------+---------+
|SITE_SITE_ID|indicator|
+------------+---------+
|si_2 |N |
|si_3 |N |
|si_1 |Y |
+------------+---------+
最后,我們用grouped
加入我們的主表:
df.join(grouped, Seq("SITE_SITE_ID"))
最后結果:
+------------+-----+------------------+---------------+---------+
|SITE_SITE_ID|IR_ID|APPLICATION_SOURCE|STANDARD_STATUS|indicator|
+------------+-----+------------------+---------------+---------+
|si_2 |ir2 |SLOW |STASH |N |
|si_3 |ir3 |MEDIUM |TREE |N |
|si_1 |ir1 |FAST |DISABLED |Y |
|si_1 |ir4 |CIBASE |ACTIVE |Y |
+------------+-----+------------------+---------------+---------+
祝你好運!
@vilalabinot 答案百分百正確:但我必須使用與 udf function 相同的邏輯來改進它:
所以我們仍然需要按 SITE_SITE_ID 分組:
val grouped0 = finvInventoryAllDf
.groupBy("SITE_SITE_ID")
.agg(
collect_set(
array(
"InstalledOffer_applicationSource",
"InstalledOffer_standardStatus", "InstalledOffer_installedOfferId"
)
).as("array")
)
但是要創建新字段,我使用 udf 會根據所需條件直接給我 Y 或 N 值,所以:
grouped0
.withColumn("activeIOsAtSite", buildFieldActiveIOsAtSite_UDF(col("array")))
.drop("array")
我正在使用的 UDF 聲明:
val buildFieldActiveIOsAtSite_UDF = udf(buildFieldActiveIOsAtSite _)
最后是這個udf的function:
def buildFieldActiveIOsAtSite(rows: mutable.WrappedArray[mutable.WrappedArray[String]]): String = {
var yesOrNoCondition = "N";
breakable {
rows.array.foreach(r => {
val installedOffer_applicationSource = Option(r(0)).getOrElse("")
val installedOffer_standardStatus = Option(r(1)).getOrElse("")
val InstalledOffer_installedOfferId = Option(r(2)).getOrElse("")
val yesCondition = InstalledOffer_installedOfferId.nonEmpty &&
!InstalledOffer_installedOfferId.equalsIgnoreCase("null") &&
installedOffer_standardStatus.equalsIgnoreCase("active") &&
installedOffer_applicationSource.equalsIgnoreCase("CIBASE")
if (yesCondition) {
yesOrNoCondition = "Y"
break
}
})
}
yesOrNoCondition
}
現在剩下要做的就是與主 dataframe 進行小連接:
val finvResultOutput1 = finvInventoryAllDf.join(grouped0, Seq("SITE_SITE_ID"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.