簡體   English   中英

引發火花:java.lang.StackOverflowError窗口函數?

[英]Spark Caused by: java.lang.StackOverflowError Window Function?

我認為是由Window Function引起的錯誤。

當我應用此腳本並僅保留幾個示例行時,它可以正常工作,但是當我將其應用於我的整個數據集(僅幾個GB)時,在嘗試持久化到hdfs時,在最后一步中會出現此異常錯誤,因此失敗了...當我堅持不使用Window Function時,腳本可以工作,因此問題必須出在此( 我有大約325個功能列通過for循環運行 )。

知道是什么原因引起的嗎? 我的目標是通過正向填充方法將時間序列數據插入數據框中的每個變量。

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window
import sys
print(spark.version)
'2.3.0'

# sample data
df = spark.createDataFrame([('2019-05-10 7:30:05', '1', '10', '0.5', 'FALSE'),\
                            ('2019-05-10 7:30:10', '2', 'UNKNOWN', '0.24', 'FALSE'),\
                            ('2019-05-10 7:30:15', '3', '6', 'UNKNOWN', 'TRUE'),\
                            ('2019-05-10 7:30:20', '4', '7', 'UNKNOWN', 'UNKNOWN'),\
                            ('2019-05-10 7:30:25', '5', '10', '1.1', 'UNKNOWN'),\
                            ('2019-05-10 7:30:30', '6', 'UNKNOWN', '1.1', 'NULL'),\
                            ('2019-05-10 7:30:35', '7', 'UNKNOWN', 'UNKNOWN', 'TRUE'),\
                            ('2019-05-10 7:30:49', '8', '50', 'UNKNOWN', 'UNKNOWN')], ["date", "id", "v1", "v2", "v3"])

df = df.withColumn("date", F.col("date").cast("timestamp"))

# imputer process / all cols that need filled are strings
def stringReplacer(x, y):
    return F.when(x != y, x).otherwise(F.lit(None)) # replace with NULL

def forwardFillImputer(df, cols=[], partitioner="date", value="UNKNOWN"):
  for i in cols:
    window = Window\
    .partitionBy(F.month(partitioner))\
    .orderBy(partitioner)\
    .rowsBetween(-sys.maxsize, 0)

    df = df\
    .withColumn(i, stringReplacer(F.col(i), value))
    fill = F.last(df[i], ignorenulls=True).over(window)
    df = df.withColumn(i,  fill)
  return df
df2 = forwardFillImputer(df, cols=[i for i in df.columns])

# errors here
df2\
.write\
.format("csv")\
.mode("overwrite")\
.option("header", "true")\
.save("test_window_func.csv")
Py4JJavaError: An error occurred while calling o13504.save.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:224)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.StackOverflowError
    at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:200)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:200)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$prepare$1.apply(SparkPlan.scala:200)
    at scala.collection.immutable.List.foreach(List.scala:381)

可能的解決方案

def forwardFillImputer(df, cols=[], partitioner="date", value="UNKNOWN"):
    window = Window \
     .partitionBy(F.month(partitioner)) \
     .orderBy(partitioner) \
     .rowsBetween(-sys.maxsize, 0)
    imputed_cols = [F.last(stringReplacer(F.col(i), value), ignorenulls=True).over(window).alias(i) 
                    for i in cols]
    missing_cols = [i for i in df.columns if i not in cols]
    return df.select(missing_cols+imputed_cols)

df2 = forwardFillImputer(df, cols=[i for i in df.columns[1:]])

df2.printSchema()
root
 |-- date: timestamp (nullable = true)
 |-- id: string (nullable = true)
 |-- v1: string (nullable = true)
 |-- v2: string (nullable = true)
 |-- v3: string (nullable = true)

df2.show()
+-------------------+---+---+----+-----+
|               date| id| v1|  v2|   v3|
+-------------------+---+---+----+-----+
|2019-05-10 07:30:05|  1| 10| 0.5|FALSE|
|2019-05-10 07:30:10|  2| 10|0.24|FALSE|
|2019-05-10 07:30:15|  3|  6|0.24| TRUE|
|2019-05-10 07:30:20|  4|  7|0.24| TRUE|
|2019-05-10 07:30:25|  5| 10| 1.1| TRUE|
|2019-05-10 07:30:30|  6| 10| 1.1| NULL|
|2019-05-10 07:30:35|  7| 10| 1.1| TRUE|
|2019-05-10 07:30:49|  8| 50| 1.1| TRUE|
+-------------------+---+---+----+-----+

通過提供的堆棧跟蹤信息,我相信錯誤來自執行計划的准備,正如它所說的:

Caused by: java.lang.StackOverflowError
    at org.apache.spark.sql.execution.SparkPlan.prepare(SparkPlan.scala:200)

我相信這樣做的原因是因為您在循環中兩次調用了.withColumn方法。 .withColumn在Spark執行計划中所做的基本上是對所有列的select語句,其中更改了方法中指定的1列。 如果您有325列,則對於單次迭代,這將對325列調用select兩次-> 650列傳遞到計划程序中。 這樣做325次,您可以看到它如何產生開銷。

但是,非常有趣的是,您不會因為一小部分樣本而收到此錯誤,否則我會期望如此。

無論如何,您可以嘗試像這樣替換您的forwardFillImputer:

def forwardFillImputer(df, cols=[], partitioner="date", value="UNKNOWN"):
    window = Window \
     .partitionBy(F.month(partitioner)) \
     .orderBy(partitioner) \
     .rowsBetween(-sys.maxsize, 0)

    imputed_cols = [F.last(stringReplacer(F.col(i), value), ignorenulls=True).over(window).alias(i) 
                    for i in cols]

    missing_cols = [F.col(i) for i in df.columns if i not in cols]

    return df.select(missing_cols + imputed_cols)

這樣,您基本上只需將一個select語句解析到計划器中,該語句應該更易於處理。

就像警告一樣,通常Spark在列數過多時效果不佳,因此您可能會在此過程中看到其他奇怪的問題。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM