简体   繁体   中英

Retrieve array stored in the dataframe column for each row using scala apark

Following dataframe belongs to me

+-------------------------------+-------------------------+
|value                          |feeling                  |
+-------------------------------+-------------------------+
|Sam got these marks            |[sad, sad, dissappointed ]|
|I got good marks               |[happy, excited, happy]   |
+-------------------------------+-------------------------+

I want to iterate through this dataframe and get the array of marks column per each row and use the marks array for some calculation method.

def calculationMethod(arrayValue : Array[String]) {
//get averege of words
}

output dataframe

  +-------------------------------+-----------------------------+--------------
    |value                          |feeling                   |average       |
    +-------------------------------+-----------------------------------------+
    |Sam got these marks            |[sad, sad, dissappointed ]|sad           |
    |I got good marks               |[happy, excited, happy]   |happy         |
    +-------------------------------+-----------------------------------------+

I am not sure how I can iterate through each row and get the array in the second column that can be passed into my written method. Also please note that the dataframe shown in the question is a stream dataframe.

EDIT 1

val calculateUDF = udf(calculationMethod _)
    val editedDataFrame = filteredDataFrame.withColumn("average", calculateUDF(col("feeling"))) 

def calculationMethod(emojiArray: Seq[String]) : DataFrame {
val existingSparkSession = SparkSession.builder().getOrCreate()
    import existingSparkSession.implicits._
    val df = emojiArray.toDF("feeling")
    val result = df.selectExpr(
      "feeling",
      "'U+' || trim('0' , string(hex(encode(feeling, 'utf-32')))) as unicode"
    )
    result
}

I'm getting the following error

Schema for type org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] is not supported

Please note that initial dataframe mentioned in the question is a stream dataframe

EDIT 2

This should be the final dataframe that I am expecting

    +-------------------+--------------+-------------------------+
    |value              |feeling       |unicode                  |
    +-------------------+--------------+-------------------------+
    |Sam got these marks|[😀😆😁]     |[U+1F600 U+1F606 U+1F601]|
    |I got good marks   |[😄🙃]        | [U+1F604 U+1F643 ]      |
    +-------------------+---------------+-------------------------+

You can transform the arrays instead of using a UDF:

val df2 = df.withColumn(
    "unicode", 
    expr("transform(feeling, x -> 'U+' || ltrim('0' , string(hex(encode(x, 'utf-32')))))")
)

df2.show(false)
+-------------------+------------+---------------------------+
|value              |feeling     |unicode                    |
+-------------------+------------+---------------------------+
|Sam got these marks|[😀, 😆, 😁]|[U+1F600, U+1F606, U+1F601]|
|I got good marks   |[😄, 🙃]    |[U+1F604, U+1F643]         |
+-------------------+------------+---------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM