Spark dataFrame for-if 循环需要很长时间

Question

I have a Spark DF (df):我有一个 Spark DF (df)：

I have to convert below into something like this:我必须在下面转换成这样的：

Basically it should detect a new sentence whenever it finds a full stop (".") and another row.基本上它应该在找到句号（“.”）和另一行时检测到一个新句子。

I have written a code for above:我已经为上面写了一个代码：

val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").config("spark.scheduler.mode", "FAIR").getOrCreate()
val count = df.count.toInt
var emptyDF = Seq.empty[(Int, Int, String)].toDF("start_time", "end_time", "Sentences")
var b = 0
for (a <- 1 to count){
if(d9.select("words").head(a)(a-1).toSeq.head == "." || a == (count-1))
{
val myList1 = d9.select("words").head(a).toArray.map(_.getString(0))
val myList = d9.select("words").head(a).toArray.map(_.getString(0)).splitAt(b)._2
val text = myList.mkString(" ")
val end_time = d9.select("end_time").head(a)(a-1).toSeq.head.toString.toInt
val start_time = d9.select("start_time").head(a)(b).toSeq.head.toString.toInt
val df1 = spark.sparkContext.parallelize(Seq(start_time)).toDF("start_time")
val df2 = spark.sparkContext.parallelize(Seq(end_time)).toDF("end_time")
val df3 = spark.sparkContext.parallelize(Seq(text)).toDF("Sentences")
val df4 = df1.crossJoin(df2).crossJoin(df3)
emptyDF = emptyDF.union(df4).toDF
b = a
}
}

Though its giving the correct output but its taking ages to complete iteration and I have 117 other df's which I need to run.虽然它提供了正确的输出，但完成迭代需要很长时间，而且我还有 117 个其他 df 需要运行。

Any other way to Tune this code or any other way to achieve above operation?任何其他方式来调整此代码或任何其他方式来实现上述操作？ Any help will be deeply appreciated.任何帮助将不胜感激。

Answer 1

Here is my try.这是我的尝试。 You can use a window to separate the sentence by counting the number of .您可以使用窗口通过计算的数量来分隔句子. for the following rows.对于以下行。

import org.apache.spark.sql.expressions.Window
val w = Window.orderBy("start_time").rowsBetween(Window.currentRow, Window.unboundedFollowing)
val df = Seq((132, 135, "Hi"),
             (135, 135, ","),
             (143, 152, "I"),
             (151, 152, "am"),
             (159, 169, "working"),
             (194, 197, "on"),
             (204, 211, "hadoop"),
             (211, 211, "."),
             (218, 212, "This"),
             (226, 229, "is"), 
             (234, 239, "Spark"),
             (245, 249, "DF"),
             (253, 258, "coding"),
             (258, 258, "."),
             (276, 276, "I")).toDF("start_time", "end_time", "words")
df.withColumn("count", count(when(col("words") === ".", true)).over(w))
  .groupBy("count")
  .agg(min("start_time").as("start_time"), max("end_time").as("end_time"), concat_ws(" ", collect_list("words")).as("Sentences"))
  .drop("count").show(false)

Then, this will give you the result as follows but it has some spaces between words and , or .然后，这将为您提供如下结果，但单词和,或之间有一些空格. as follows:如下：

+----------+--------+-----------------------------+
|start_time|end_time|Sentences                    |
+----------+--------+-----------------------------+
|132       |211     |Hi , I am working on hadoop .|
|218       |258     |This is Spark DF coding .    |
|276       |276     |I                            |
+----------+--------+-----------------------------+

Answer 2

scala> import org.apache.spark.sql.expressions.Window

scala> df.show(false)
+----------+--------+--------+
|start_time|end_time|words   |
+----------+--------+--------+
|132       |135     |Hi      |
|135       |135     |,       |
|143       |152     |I       |
|151       |152     |am      |
|159       |169     |working |
|194       |197     |on      |
|204       |211     |hadoop  |
|211       |211     |.       |
|218       |222     |This    |
|226       |229     |is      |
|234       |239     |Spark   |
|245       |249     |DF      |
|253       |258     |coding  |
|258       |258     |.       |
|276       |276     |I       |
+----------+--------+--------+


scala> val w = Window.orderBy("start_time", "end_time")

scala> df.withColumn("temp", sum(when(lag(col("words"), 1).over(w) === ".", lit(1)).otherwise(lit(0))).over(w))
             .groupBy("temp").agg(min("start_time").alias("start_time"), max("end_time").alias("end_time"),concat_ws(" ",collect_list(trim(col("words")))).alias("sentenses"))
             .drop("temp")
             .show(false)
+----------+--------+-----------------------------+
|start_time|end_time|sentenses                    |
+----------+--------+-----------------------------+
|132       |211     |Hi , I am working on hadoop .|
|218       |258     |This is Spark DF coding .    |
|276       |276     |I                            |
+----------+--------+-----------------------------+

Answer 3

Here is my approach using udf without window function.这是我使用没有窗口函数的 udf 的方法。

val df=Seq((123,245,"Hi"),(123,245,"."),(123,245,"Hi"),(123,246,"I"),(123,245,".")).toDF("start","end","words")

  var count=0
  var flag=false
  val counterUdf=udf((dot:String) => {
    if(flag) {
      count+=1
    flag=false
    }
    if (dot == ".")
      flag=true
    count
  })

  val df1=df.withColumn("counter",counterUdf(col("words")))

  val df2=df1.groupBy("counter").agg(min("start").alias("start"), max("end").alias("end"), concat_ws(" ", collect_list("words")).alias("sentence")).drop("counter")

  df2.show()

+-----+---+--------+
|start|end|sentence|
+-----+---+--------+
|  123|246|  Hi I .|
|  123|245|    Hi .|
+-----+---+--------+

Spark dataFrame for-if 循环需要很长时间

问题描述

3 个解决方案

解决方案1
1 2020-02-17 14:25:37

解决方案2
1 2020-02-18 06:42:09

解决方案3
0 2020-02-17 16:05:12

Spark dataFrame for-if 循环需要很长时间

问题描述

3 个解决方案

解决方案1 1 2020-02-17 14:25:37

解决方案2 1 2020-02-18 06:42:09

解决方案3 0 2020-02-17 16:05:12

解决方案1
1 2020-02-17 14:25:37

解决方案2
1 2020-02-18 06:42:09

解决方案3
0 2020-02-17 16:05:12