简体   繁体   English

在Scala中将Spark数据框列与其行连接起来

[英]Concatenate spark data frame column with its rows in Scala

I am trying build a string by concatenating values from a data frame. 我正在尝试通过串联数据帧中的值来构建字符串。 For example: 例如:

val df = Seq(
  ("20181001","10"),     
  ("20181002","40"),
  ("20181003","50")).toDF("Date","Key")
df.show

Output of the DF is like below. DF的输出如下。

在此处输入图片说明

Here I want to build the condition based on the values of the data frame, such as: (Date=20181001 and key=10) or (Date=20181002 and key=40) or (Date=20181003 and key=50) and so on..The generated condition will serve as input for another process. 在这里我想基于数据帧的值构建条件,例如:(Date = 20181001 and key = 10)或(Date = 20181002 and key = 40)或(Date = 20181003 and key = 50)等等生成的条件将用作另一个过程的输入。 Here the columns in the data frame can be dynamic. 这里,数据框中的列可以是动态的。

The below snippet I am trying and it is forming the string as needed but its a static one.Also not very sure how it will perform when i have to generate the condition for more than 10 columns. 我正在尝试下面的代码段,它正在根据需要形成字符串,但它是一个静态的字符串。也不太确定当我必须生成超过10列的条件时它将如何执行。 Any suggestion is highly appreciated. 任何建议都将受到高度赞赏。

val df = Seq(
  ("20181001","10"),     
  ("20181002","40"),
  ("20181003","50")).toDF("Date","Key")

val colList = df.columns
var cond1 = ""
var finalCond =""
for (row <- df.rdd.collect)
 {
    cond1 = "("
    var pk = row.mkString(",").split(",")(0)
    cond1 = cond1+colList(0)+"="+pk
    var ak = row.mkString(",").split(",")(1)
    cond1 = cond1 +" and " + colList(1)+ "=" +ak +")"
    finalCond = finalCond + cond1 + " or " 
    cond1= ""    
 }
 print("Condition:" +finalCond.dropRight(3))

在此处输入图片说明

Check this DF solution. 检查此DF解决方案。

scala> val df = Seq(
       |   ("20181001","10"),
       |   ("20181002","40"),
       |   ("20181003","50")).toDF("Date","Key")
df: org.apache.spark.sql.DataFrame = [Date: string, Key: string]

scala> val df2 = df.withColumn("gencond",concat(lit("(Date="), 'Date, lit(" and Key=") ,'Key,lit(")")))
df2: org.apache.spark.sql.DataFrame = [Date: string, Key: string ... 1 more field]


scala> df2.agg(collect_list('gencond)).show(false)
+------------------------------------------------------------------------------------+
|collect_list(gencond)                                                               |
+------------------------------------------------------------------------------------+
|[(Date=20181001 and Key=10), (Date=20181002 and Key=40), (Date=20181003 and Key=50)]|
+------------------------------------------------------------------------------------+

EDIT1 编辑1

You can read them from parquet files and just change the names as in this solution. 您可以从实木复合地板文件中读取它们,只需按照此解决方案更改名称即可。 In the final step, again replace the names from the parquet header. 在最后一步中,再次替换实木复合地板标题中的名称。 Check this. 检查一下。

scala> val df = Seq(("101","Jack"),("103","wright")).toDF("id","name")  // Original names from parquet
df: org.apache.spark.sql.DataFrame = [id: string, name: string]

scala> val df2= df.select("*").toDF("Date","Key")  // replace it with Date/Key as we used in this question
df2: org.apache.spark.sql.DataFrame = [Date: string, Key: string]

scala> val df3 = df2.withColumn("gencond",concat(lit("(Date="), 'Date, lit(" and Key=") ,'Key,lit(")")))
df3: org.apache.spark.sql.DataFrame = [Date: string, Key: string ... 1 more field]

scala> val df4=df3.agg(collect_list('gencond).as("list"))
df4: org.apache.spark.sql.DataFrame = [list: array<string>]

scala> df4.select(concat_ws(" or ",'list)).show(false)
+----------------------------------------------------+
|concat_ws( or , list)                               |
+----------------------------------------------------+
|(Date=101 and Key=Jack) or (Date=103 and Key=wright)|
+----------------------------------------------------+

scala> val a = df.columns(0)
a: String = id

scala> val b = df.columns(1)
b: String = name

scala>  df4.select(concat_ws(" or ",'list).as("new1")).select(regexp_replace('new1,"Date",a).as("colx")).select(regexp_replace('colx,"Key",b).as("colxy")).show(false)
+--------------------------------------------------+
|colxy                                             |
+--------------------------------------------------+
|(id=101 and name=Jack) or (id=103 and name=wright)|
+--------------------------------------------------+


scala>

Calling collect pulls the result back to the driver program so if you have a huge DataFrame you may well run out of memory. 调用collect将结果拉回到驱动程序,因此,如果您有一个庞大的DataFrame,则很可能会耗尽内存。

If you are sure you are only dealing with a smallish number of rows that isn't a problem. 如果您确定只处理少量的行,那不是问题。

You could do something like: 您可以执行以下操作:

df.map(row => s"($Date={row.getString(0)} and Key=${row.getString(1)})").collect.mkString("Condition: ", " or ", "")

Output: 输出:

res2: String = Condition: (Date=20181001 and Key=10) or (Date=20181002 and Key=40) or (Date=20181003 and Key=50)

Using udf you can do for variable number of columns like below 使用udf您可以对可变的columns数进行操作,如下所示

val list=List("Date","Key")

def getCondString(row:Row):String={
    "("+list.map(cl=>cl+"="+row.getAs[String](cl)).mkString(" and ")+")"
  }

val getCondStringUDF=udf(getCondString _)
df.withColumn("row", getCondStringUDF(struct(df.columns.map(df.col(_)):_*))).select("row").rdd.map(_(0).toString()).collect().mkString(" or ")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM