简体   繁体   English

Spark dataFrame for-if 循环需要很长时间

[英]Spark dataFrame for-if loop taking a Long time

I have a Spark DF (df):我有一个 Spark DF (df):

在此处输入图片说明

I have to convert below into something like this:我必须在下面转换成这样的:

在此处输入图片说明

Basically it should detect a new sentence whenever it finds a full stop (".") and another row.基本上它应该在找到句号(“.”)和另一行时检测到一个新句子。

I have written a code for above:我已经为上面写了一个代码:

val spark = SparkSession.builder.appName("elasticSpark").master("local[*]").config("spark.scheduler.mode", "FAIR").getOrCreate()
val count = df.count.toInt
var emptyDF = Seq.empty[(Int, Int, String)].toDF("start_time", "end_time", "Sentences")
var b = 0
for (a <- 1 to count){
if(d9.select("words").head(a)(a-1).toSeq.head == "." || a == (count-1))
{
val myList1 = d9.select("words").head(a).toArray.map(_.getString(0))
val myList = d9.select("words").head(a).toArray.map(_.getString(0)).splitAt(b)._2
val text = myList.mkString(" ")
val end_time = d9.select("end_time").head(a)(a-1).toSeq.head.toString.toInt
val start_time = d9.select("start_time").head(a)(b).toSeq.head.toString.toInt
val df1 = spark.sparkContext.parallelize(Seq(start_time)).toDF("start_time")
val df2 = spark.sparkContext.parallelize(Seq(end_time)).toDF("end_time")
val df3 = spark.sparkContext.parallelize(Seq(text)).toDF("Sentences")
val df4 = df1.crossJoin(df2).crossJoin(df3)
emptyDF = emptyDF.union(df4).toDF
b = a
}
}

Though its giving the correct output but its taking ages to complete iteration and I have 117 other df's which I need to run.虽然它提供了正确的输出,但完成迭代需要很长时间,而且我还有 117 个其他 df 需要运行。

Any other way to Tune this code or any other way to achieve above operation?任何其他方式来调整此代码或任何其他方式来实现上述操作? Any help will be deeply appreciated.任何帮助将不胜感激。

Here is my try.这是我的尝试。 You can use a window to separate the sentence by counting the number of .您可以使用窗口通过计算 的数量来分隔句子. for the following rows.对于以下行。

import org.apache.spark.sql.expressions.Window
val w = Window.orderBy("start_time").rowsBetween(Window.currentRow, Window.unboundedFollowing)
val df = Seq((132, 135, "Hi"),
             (135, 135, ","),
             (143, 152, "I"),
             (151, 152, "am"),
             (159, 169, "working"),
             (194, 197, "on"),
             (204, 211, "hadoop"),
             (211, 211, "."),
             (218, 212, "This"),
             (226, 229, "is"), 
             (234, 239, "Spark"),
             (245, 249, "DF"),
             (253, 258, "coding"),
             (258, 258, "."),
             (276, 276, "I")).toDF("start_time", "end_time", "words")
df.withColumn("count", count(when(col("words") === ".", true)).over(w))
  .groupBy("count")
  .agg(min("start_time").as("start_time"), max("end_time").as("end_time"), concat_ws(" ", collect_list("words")).as("Sentences"))
  .drop("count").show(false)

Then, this will give you the result as follows but it has some spaces between words and , or .然后,这将为您提供如下结果,但单词和,或 之间有一些空格. as follows:如下:

+----------+--------+-----------------------------+
|start_time|end_time|Sentences                    |
+----------+--------+-----------------------------+
|132       |211     |Hi , I am working on hadoop .|
|218       |258     |This is Spark DF coding .    |
|276       |276     |I                            |
+----------+--------+-----------------------------+
scala> import org.apache.spark.sql.expressions.Window

scala> df.show(false)
+----------+--------+--------+
|start_time|end_time|words   |
+----------+--------+--------+
|132       |135     |Hi      |
|135       |135     |,       |
|143       |152     |I       |
|151       |152     |am      |
|159       |169     |working |
|194       |197     |on      |
|204       |211     |hadoop  |
|211       |211     |.       |
|218       |222     |This    |
|226       |229     |is      |
|234       |239     |Spark   |
|245       |249     |DF      |
|253       |258     |coding  |
|258       |258     |.       |
|276       |276     |I       |
+----------+--------+--------+


scala> val w = Window.orderBy("start_time", "end_time")

scala> df.withColumn("temp", sum(when(lag(col("words"), 1).over(w) === ".", lit(1)).otherwise(lit(0))).over(w))
             .groupBy("temp").agg(min("start_time").alias("start_time"), max("end_time").alias("end_time"),concat_ws(" ",collect_list(trim(col("words")))).alias("sentenses"))
             .drop("temp")
             .show(false)
+----------+--------+-----------------------------+
|start_time|end_time|sentenses                    |
+----------+--------+-----------------------------+
|132       |211     |Hi , I am working on hadoop .|
|218       |258     |This is Spark DF coding .    |
|276       |276     |I                            |
+----------+--------+-----------------------------+

Here is my approach using udf without window function.这是我使用没有窗口函数的 udf 的方法。

val df=Seq((123,245,"Hi"),(123,245,"."),(123,245,"Hi"),(123,246,"I"),(123,245,".")).toDF("start","end","words")

  var count=0
  var flag=false
  val counterUdf=udf((dot:String) => {
    if(flag) {
      count+=1
    flag=false
    }
    if (dot == ".")
      flag=true
    count
  })

  val df1=df.withColumn("counter",counterUdf(col("words")))

  val df2=df1.groupBy("counter").agg(min("start").alias("start"), max("end").alias("end"), concat_ws(" ", collect_list("words")).alias("sentence")).drop("counter")

  df2.show()

+-----+---+--------+
|start|end|sentence|
+-----+---+--------+
|  123|246|  Hi I .|
|  123|245|    Hi .|
+-----+---+--------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM