简体   繁体   中英

Custom processing on column in Apache Spark (Java)

I loaded a JSON document in Spark, roughly, it looks like:

root
 |-- datasetid: string (nullable = true)
 |-- fields: struct (nullable = true)
...
 |    |-- type_description: string (nullable = true)

My DF is turning it into:

df = df.withColumn("desc", df.col("fields.type_description"));

All fine, but type_description 's value looks like: "1 - My description type".

Ideally, I'd like my df to contain only the textual part, eg "My description type". I know how to do that, but how can I make it through Spark?

I was hoping some along the line of:

df = df.withColumn("desc", df.col("fields.type_description").call(/* some kind of transformation class / method*/));

Thanks!

In general Spark provides a broad set of SQL functions which vary from basic string processing utilities, through date / time processing functions, to different statistical summaries. This are part of oassql.functions . In this particular case you probably want something like this:

import static org.apache.spark.sql.functions.*;

df.withColumn("desc",
  regexp_replace(df.col("fields.type_description"), "^[0-9]*\\s*-\\s*", "")
);

Generally speaking these functions should be your first choice when working with Spark SQL. There are backed by Catalyst expressions and typically provide codegen utilities. It means you can fully benefit from different Spark SQL optimizations.

Alternative, but less efficient approach, is to implement custom UDF. See for example Creating a SparkSQL UDF in Java outside of SQLContext

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM