Custom processing on column in Apache Spark (Java)

Question

I loaded a JSON document in Spark, roughly, it looks like:

root
 |-- datasetid: string (nullable = true)
 |-- fields: struct (nullable = true)
...
 |    |-- type_description: string (nullable = true)

My DF is turning it into:

df = df.withColumn("desc", df.col("fields.type_description"));

All fine, but type_description 's value looks like: "1 - My description type".

Ideally, I'd like my df to contain only the textual part, eg "My description type". I know how to do that, but how can I make it through Spark?

I was hoping some along the line of:

df = df.withColumn("desc", df.col("fields.type_description").call(/* some kind of transformation class / method*/));

Thanks!

Answer 1

In general Spark provides a broad set of SQL functions which vary from basic string processing utilities, through date / time processing functions, to different statistical summaries. This are part of oassql.functions . In this particular case you probably want something like this:

import static org.apache.spark.sql.functions.*;

df.withColumn("desc",
  regexp_replace(df.col("fields.type_description"), "^[0-9]*\\s*-\\s*", "")
);

Generally speaking these functions should be your first choice when working with Spark SQL. There are backed by Catalyst expressions and typically provide codegen utilities. It means you can fully benefit from different Spark SQL optimizations.

Alternative, but less efficient approach, is to implement custom UDF. See for example Creating a SparkSQL UDF in Java outside of SQLContext

Custom processing on column in Apache Spark (Java)

Question

1 answers

solution1
2 ACCPTED 2016-07-04 18:05:35

Custom processing on column in Apache Spark (Java)

Question

1 answers

solution1 2 ACCPTED 2016-07-04 18:05:35

solution1
2 ACCPTED 2016-07-04 18:05:35