简体   繁体   English

Scala Spark 中的编码器[行]

[英]Encoder[Row] in Scala Spark

I'm trying to perform a simple map on a Dataset[Row] ( DataFrame ) in Spark 2.0.0.我正在尝试在 Spark 2.0.0 中的Dataset[Row] ( DataFrame ) 上执行一个简单的映射。 Something as simple as this像这样简单的事情

val df: DataSet[Row] = ...
df.map { r: Row => r }

But the compiler is complaining that I'm not providing the implicit Encoder[Row] argument to the map function:但是编译器抱怨我没有向 map 函数提供隐式Encoder[Row]参数:

not enough arguments for method map: (implicit evidence$7: Encoder[Row]).方法映射的参数不足:(隐式证据 $7:编码器 [行])。

Everything works fine if I convert to an RDD first ds.rdd.map { r: Row => r } but shouldn't there be an easy way to get an Encoder[Row] like there is for tuple types Encoders.product[(Int, Double)] ?如果我首先转换为 RDD,一切正常ds.rdd.map { r: Row => r }但不应该有一种简单的方法来获得Encoder[Row]就像元组类型Encoders.product[(Int, Double)] ?

[Note that my Row is dynamically sized in such a way that it can't easily be converted into a strongly-typed Dataset .] [请注意,我的Row是动态调整大小的,因此无法轻松将其转换为强类型Dataset 。]

An Encoder needs to know how to pack the elements inside the Row . Encoder需要知道如何打包Row的元素。 So you could write your own Encoder[Row] by using row.structType which determines the elements of your Row at runtime and uses the corresponding decoders.因此,您可以使用row.structType编写自己的Encoder[Row] ,它在运行时确定Row的元素并使用相应的解码器。

Or if you know more about the data that goes into Row , you could use https://github.com/adelbertc/frameless/或者,如果您对进入Row的数据有更多了解,可以使用https://github.com/adelbertc/frameless/

SSry to be a "bit" late. SSry“有点”晚了。 Hopefully this helps to someone who is hitting the problem right now.希望这对现在遇到问题的人有所帮助。 Easiest way to define encoder is deriving the structure from existing DataFrame:定义编码器的最简单方法是从现有 DataFrame 派生结构:

val df = Seq((1, "a"), (2, "b"), (3, "c").toDF("id", "name")
val myEncoder = RowEndocer(df.schema)

Such approach could be useful when you need altering existing fields from your original DataFrame.当您需要更改原始 DataFrame 中的现有字段时,这种方法可能很有用。

If you're dealing with completely new structure, explicit definition relying on StructType and StructField (as suggested in @Reactormonk 's little cryptic response).如果您正在处理全新的结构,则依赖StructTypeStructField显式定义(如@Reactormonk 的小神秘响应中所建议的那样)。

Example defining the same encoder:定义相同编码器的示例:

val myEncoder2 = RowEncoder(StructType(
  Seq(StructField("id", IntegerType), 
      StructField("name", StringType)
  )))

Please remember org.apache.spark.sql._ , org.apache.spark.sql.types._ and org.apache.spark.sql.catalyst.encoders.RowEncoder libraries have to be imported.请记住org.apache.spark.sql._ , org.apache.spark.sql.types._org.apache.spark.sql.catalyst.encoders.RowEncoder库必须被导入。

在映射函数不更改架构的特定情况下,您可以传入 DataFrame 本身的编码器:

df.map(r => r)(df.encoder)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM