简体   繁体   English

Spark Scala-需要遍历数据帧中的列

[英]Spark Scala - Need to iterate over column in dataframe

Got the next dataframe: 得到了下一个数据框:

+---+----------------+
|id |job_title       |
+---+----------------+
|1  |ceo             |
|2  |product manager |
|3  |surfer          |
+---+----------------+

I want to get a column from a dataframe and to create another column with indication called 'rank': 我想从数据帧中获取一列,并创建另一个指示为“ rank”的列:

+---+----------------+-------+
|id |job_title       | rank  |
+---+----------------+-------+
|1  |ceo             |c-level|
|2  |product manager |manager|
|3  |surfer          |other  |
+---+----------------+-------+

--- UPDATED --- - - 更新 - -

What I tried to do by now is: 我现在尝试做的是:

def func (col: column) : Column = {
val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")

when (col.contains(cLevel), "C-level")
.otherwise(when(col.contains(managerLevel),"manager").otherwise("other"))}

Currently I get a this error: 目前,我收到此错误:

type mismatch;
found   : Boolean
required: org.apache.spark.sql.Column

and I think I have also other problems within the code.Sorry but I'm on a starting level with Scala over Spark. 很抱歉,但我在Scala而非Spark上处于起步阶段。

You can use when/otherwise inbuilt function for that case as 您可以when/otherwise这种情况下使用when/otherwise内置函数作为

import org.apache.spark.sql.functions._
def func = when(col("job_title").contains("cheif") || col("job_title").contains("ceo"), "c-level")
  .otherwise(when(col("job_title").contains("manager"), "manager")
    .otherwise("other"))

and you can call the function by using withColumn as 您可以通过使用withColumn来调用该函数

df.withColumn("rank", func).show(false)

which should give you 这应该给你

+---+---------------+-------+
|id |job_title      |rank   |
+---+---------------+-------+
|1  |ceo            |c-level|
|2  |product manager|manager|
|3  |surfer         |other  |
+---+---------------+-------+

I hope the answer is helpful 我希望答案是有帮助的

Updated 更新

I see that you have updated your post with your tryings, and you have tried creating a list of levels and you want to validate against the list . 我看到您已经用您的尝试更新了帖子,并且尝试创建级别列表,并且想要针对该列表进行验证 For that case you will have to write a udf function as 在这种情况下,您将必须编写一个udf函数作为

val cLevel = List("ceo","cfo")
val managerLevel = List("manager","team leader")

import org.apache.spark.sql.functions._
def rankUdf = udf((jobTitle: String) => jobTitle match {
  case x if(cLevel.exists(_.contains(x)) || cLevel.exists(x.contains(_))) => "C-Level"
  case x if(managerLevel.exists(_.contains(x)) || managerLevel.exists(x.contains(_))) => "manager"
  case _ => "other"
})

df.withColumn("rank", rankUdf(col("job_title"))).show(false)

which should give you your desired output 这应该给你你想要的输出

 val df = sc.parallelize(Seq(
  (1,"ceo"),
  ( 2,"product manager"), 
  (3,"surfer"),
  (4,"Vaquar khan")
)).toDF("id", "job_title")

df.show()
//option 2
df.createOrReplaceTempView("user_details")


sqlContext.sql("SELECT job_title, RANK() OVER (ORDER BY id) AS rank FROM user_details").show


val df1 = sc.parallelize(Seq(
  ("ceo","c-level"), 
  ( "product manager","manager"),
  ("surfer","other"),
  ("Vaquar khan","Problem solver")
)).toDF("job_title", "ranks")
df1.show()
df1.createOrReplaceTempView("user_rank")


sqlContext.sql("SELECT user_details.id,user_details.job_title,user_rank.ranks FROM user_rank JOIN user_details ON user_rank.job_title = user_details.job_title order by user_details.id").show

Results : 结果:

+---+---------------+
| id|      job_title|
+---+---------------+
|  1|            ceo|
|  2|product manager|
|  3|         surfer|
|  4|    Vaquar khan|
+---+---------------+

+---------------+----+
|      job_title|rank|
+---------------+----+
|            ceo|   1|
|product manager|   2|
|         surfer|   3|
|    Vaquar khan|   4|
+---------------+----+

+---------------+--------------+
|      job_title|         ranks|
+---------------+--------------+
|            ceo|       c-level|
|product manager|       manager|
|         surfer|         other|
|    Vaquar khan|Problem solver|
+---------------+--------------+

+---+---------------+--------------+
| id|      job_title|         ranks|
+---+---------------+--------------+
|  1|            ceo|       c-level|
|  2|product manager|       manager|
|  3|         surfer|         other|
|  4|    Vaquar khan|Problem solver|
+---+---------------+--------------+

df: org.apache.spark.sql.DataFrame = [id: int, job_title: string]
df1: org.apache.spark.sql.DataFrame = [job_title: string, ranks: string]

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM