简体   繁体   English

自定义处理Apache Spark(Java)中的列

[英]Custom processing on column in Apache Spark (Java)

I loaded a JSON document in Spark, roughly, it looks like: 我在Spark中加载了一个JSON文档,大致如下:

root
 |-- datasetid: string (nullable = true)
 |-- fields: struct (nullable = true)
...
 |    |-- type_description: string (nullable = true)

My DF is turning it into: 我的DF正在将其转换为:

df = df.withColumn("desc", df.col("fields.type_description"));

All fine, but type_description 's value looks like: "1 - My description type". 一切正常,但type_description的值类似于:“ 1-我的描述类型”。

Ideally, I'd like my df to contain only the textual part, eg "My description type". 理想情况下,我希望我的df只包含文本部分,例如“我的描述类型”。 I know how to do that, but how can I make it through Spark? 我知道该怎么做,但是如何通过Spark做到呢?

I was hoping some along the line of: 我希望遵循以下原则:

df = df.withColumn("desc", df.col("fields.type_description").call(/* some kind of transformation class / method*/));

Thanks! 谢谢!

In general Spark provides a broad set of SQL functions which vary from basic string processing utilities, through date / time processing functions, to different statistical summaries. 通常,Spark提供了广泛的SQL函数集,从基本的字符串处理实用程序(通过日期/时间处理功能)到不同的统计摘要,不一而足。 This are part of oassql.functions . 这是oassql.functions一部分。 In this particular case you probably want something like this: 在这种情况下,您可能想要这样的东西:

import static org.apache.spark.sql.functions.*;

df.withColumn("desc",
  regexp_replace(df.col("fields.type_description"), "^[0-9]*\\s*-\\s*", "")
);

Generally speaking these functions should be your first choice when working with Spark SQL. 通常,在使用Spark SQL时,这些功能应该是您的首选。 There are backed by Catalyst expressions and typically provide codegen utilities. 有Catalyst表达式支持,通常提供codegen实用程序。 It means you can fully benefit from different Spark SQL optimizations. 这意味着您可以从不同的Spark SQL优化中完全受益。

Alternative, but less efficient approach, is to implement custom UDF. 另一种但效率较低的方法是实现自定义UDF。 See for example Creating a SparkSQL UDF in Java outside of SQLContext 例如,请参见在SQLContext外部用Java创建SparkSQL UDF

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM