简体   繁体   English

Scala spark + 编码器问题

[英]Scala spark + encoder issues

Working on a problem where I need to add a new column that holds the length of all characters under all columns.解决我需要添加一个新列的问题,该列包含所有列下所有字符的长度。

My sample data set :我的样本数据集:

ItemNumber,StoreNumber,SaleAmount,Quantity, Date
2231      ,  1        , 400      ,  2     , 19/01/2020
2145      ,  3        , 500      ,  10    , 14/01/2020

The expected output would be预期的输出将是

19 20 19 20

The ideal output am expecting to build is with new column Length added to the data frame我期望构建的理想输出是将新列Length添加到数据框中

ItemNumber,StoreNumber,SaleAmount,Quantity, Date      , Length
2231      ,  1        , 400      ,  2     , 19/01/2020, 19
2145      ,  3        , 500      ,  10    , 14/01/2020, 20

My code我的代码

 val spark = SparkSession.builder()
    .appName("SimpleNewIntColumn").master("local").enableHiveSupport().getOrCreate()

  val df = spark.read.option("header","true").csv("./data/sales.csv")

  var schema = new StructType

  df.schema.toList.map{
    each => schema = schema.add(each)
  }
  val encoder = RowEncoder(schema)

  val charLength = (row :Row) => {
    var len :Int = 0
    row.toSeq.map(x =>  {
      x match {
        case a : Int => len = len + a.toString.length
        case a : String => len = len + a.length
      }

    })
    len
  }

  df.map(row => charLength(row))(encoder) // ERROR - Required Encoder[Int] Found EncoderExpression[Row]

  df.withColumn("Length", ?)

I have two issues我有两个问题

1) How to solve the error "ERROR - Required Encoder[Int] Found EncodeExpression[Row]"? 1) 如何解决错误“ERROR - Required Encoder[Int] Found EncodeExpression[Row]”?

2) How do I add the output of charLength function as new column value? 2) 如何将 charLength 函数的输出添加为新列值? - df.withColumn("Length", ?) - df.withColumn("长度", ?)

Thank you.谢谢你。

Gurupraveen古鲁拉文

If you are just trying to add a column, with total length of that Row如果您只是想添加一列,该行的总长度

You can simply concat all the columns cast to String and use length function你可以简单地concat所有列castString和使用length的功能

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType

val concatCol = concat(df.columns.map(col(_).cast(StringType)):_*)

df.withColumn("Length", length(concatCol))

Output:输出:

+----------+-----------+----------+--------+----------+------+
|ItemNumber|StoreNumber|SaleAmount|Quantity|      Date|length|
+----------+-----------+----------+--------+----------+------+
|      2231|          1|       400|       2|19/01/2020|    19|
|      2145|          3|       500|      10|14/01/2020|    20|
+----------+-----------+----------+--------+----------+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM