简体   繁体   中英

How to apply a function to a column of a Spark DataFrame?

Let's assume that we have a Spark DataFrame

df.getClass
Class[_ <: org.apache.spark.sql.DataFrame] = class org.apache.spark.sql.DataFrame

with the following schema

df.printSchema
root
|-- rawFV: string (nullable = true)
|-- tk: array (nullable = true)
|    |-- element: string (containsNull = true)

Given that each row of the tk column is an array of strings, how to write a Scala function that will return the number of elements in each row?

You don't have to write a custom function because there is one:

import org.apache.spark.sql.functions.size

df.select(size($"tk"))

If you really want you can write an udf :

import org.apache.spark.sql.functions.udf

val size_ = udf((xs: Seq[String]) => xs.size)

or even create custom a expression but there is really no point in that.

One way is to access them using the sql like below.

df.registerTempTable("tab1")
val df2 = sqlContext.sql("select tk[0], tk[1] from tab1")

df2.show()

To get size of array column,

val df3 = sqlContext.sql("select size(tk) from tab1")
df3.show()

If your Spark version is older, you can use HiveContext instead of Spark's SQL Context.

I would also try for some thing that traverses.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM