How to perform general processing on Spark StructType in Scala like choosing field by name, iterating over map/list field, etc?
In spark dataframe, I have column "instances" of Type "ArrayType" with following schema:
instances[ArrayType]:
0 [ StructType:
name [StringType]
address[StringType]
experiences[MapType]:
Company-1[StringType]:
StructType:
numYears[IntType]: 5
grade[IntType]
Company-2[StringType]:
StructType:
numYears[IntType]: 12
grade[IntType]]
1 [ StructType:
name [StringType]
address[StringType]
experiences[MapType]:
Company-1[StringType]:
StructType:
numYears[IntType]: 3
grade[IntType]
Company-2[StringType]:
StructType:
numYears[IntType]: 9
grade[IntType]]
I need to convert this ArrayType column "instances" to column "totalExperience" of type
derived column "totalExperience" of type "MapType"[StringType -> IntType]
company-1: 8
company-2: 21
Note: (5 + 3 = 8 and 12 + 9 = 21)
Equivalent psuedo-code for this:
totalExperience = Map<String, Int>();
for (instance in instances) {
for ((currentExperience, numYears) in instance.getExperiences().entries()) {
if (!totalExperience.contains(currentExperience)) {
totalExperience.put(currentExperience, 0);
}
totalExperience.put(currentExperience, totalExperience.get(currentExperience) + numYears);
}
}
return totalExperience
I have written UDF for this as follows but I did not find any way of implementing above pseudo-code in Scala-spark:
private val computeTotalExperience = udf(_ => MapType = (instances: ArrayType) => {
val totalExperienceByCompany = DataTypes.createMapType(StringType, LongType)
**How to iterate over "instances" with type as "ArrayType" ?**
for (instance <- instances) {
**How to access and iterate over "experiences" mapType field on instance ???**
// Populate totalExperienceByCompany(MapType) with key as "company-1" name
}
delayReasons
})
How to perform above General processing on ListType, MapType, StructType fields of Spark dataframe in Scala in UDF?
Check below code.
scala> df.printSchema
root
|-- instances: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- address: string (nullable = true)
| | |-- experiences: map (nullable = true)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = true)
| | | | |-- numYears: integer (nullable = true)
| | | | |-- grade: string (nullable = true)
| | |-- name: string (nullable = true)
scala> df.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------+
|instances |
+-----------------------------------------------------------------------------------------------------------------------------------+
|[[address_0, [Company-1 -> [5, 1], Company-2 -> [12, 1]], name_0], [address_1, [Company-1 -> [3, 1], Company-2 -> [9, 1]], name_1]]|
+-----------------------------------------------------------------------------------------------------------------------------------+
scala>
val expr = array(
struct(lit("company-1").as("company"),$"instance.experiences.Company-1.numYears"),
struct(lit("company-2").as("company"),$"instance.experiences.Company-2.numYears")
)
scala>
df
.withColumn("instance",explode($"instances"))
.withColumn("company",explode(expr))
.select("company.*")
.groupBy($"company")
.agg(sum($"numYears").as("numYears"))
.select(map($"company",$"numYears").as("totalExperience"))
.show(false)
+-----------------+
|totalExperience |
+-----------------+
|[company-1 -> 8] |
|[company-2 -> 21]|
+-----------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.