简体   繁体   中英

java_gateway.py file in send_command cannot send command for larger payload

Description of the problem

I am currently trying to run a UDF ( UDF function name : mean_of_assets_in_win) over defined window (size of window: 30, step: 1) for one column from a dataframe called in code assets_with_yields_df.

Generally, this UDF counts the mean of float numbers in a column. Everything worked fine when I tested the logic on a dataset of 800 records contained in a text file. However, increasing the text file to 500'000 records causes the error on the finish of the program. The error log is listed below:

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/E:/BigData/Spark/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
SUCCESS: The process with PID 12304 (child process of PID 32872) has been terminated.
SUCCESS: The process with PID 32872 (child process of PID 16792) has been terminated.
SUCCESS: The process with PID 16792 (child process of PID 26332) has been terminated.
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:64278)
Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1193, in send_command
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1033, in send_command
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1196, in send_command
py4j.protocol.Py4JNetworkError: Error while sending

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 977, in _get_connection
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1115, in start
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:64278)
Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 977, in _get_connection
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1115, in start
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:64278)
Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 977, in _get_connection
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1115, in start
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:64278)
Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 977, in _get_connection
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1115, in start
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

From my perspective it is something related to this what I extracted from this long log

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:64278)
File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1193, in send_command
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
py4j.protocol.Py4JNetworkError: Error while sending

I checked the code in java_gateway.py and this is the function which sends a command to the JVM. It is also mentioned there that

self.socket.sendall(command.encode("utf-8"))

will only fail if remote is closed for large payloads or if it sent a RST packet (SO_LINGER). I do not know how to make the remote open for large payloads if that's the case. Do you know how to do it?

What is also surprising for me that after trying to run this program again I got different error listed below. Maybe it is kind of a tip what is wrong...

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/E:/BigData/Spark/jars/spark-unsafe_2.12-3.0.3.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/06/07 23:15:30 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
        at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
        at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
        at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:465)
        at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:285)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:295)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
22/06/07 23:15:30 ERROR PythonRunner: This may have been caused by a prior exception:
java.net.SocketException: Connection reset by peer: socket write error
        at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
        at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
        at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:465)
        at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:285)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:295)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
22/06/07 23:15:30 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.net.SocketException: Connection reset by peer: socket write error
        at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
        at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
        at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:465)
        at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:285)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:295)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)
22/06/07 23:15:30 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "F:\Studia\02_Magisterskie\04_Semestr4\03_PSD\01_Projekt\Portfel Inwestycyjny\Portfolio-Monitor\src\so.py", line 47, in <module>
    assets_with_yields_df = spark.createDataFrame(assets_with_yields_rdd).cache()
  File "E:\BigData\Spark\python\pyspark\sql\session.py", line 605, in createDataFrame
    return self._create_dataframe(data, schema, samplingRatio, verifySchema)
  File "E:\BigData\Spark\python\pyspark\sql\session.py", line 628, in _create_dataframe
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File "E:\BigData\Spark\python\pyspark\sql\session.py", line 425, in _createFromRDD
    struct = self._inferSchema(rdd, samplingRatio, names=schema)
  File "E:\BigData\Spark\python\pyspark\sql\session.py", line 396, in _inferSchema
    first = rdd.first()
  File "E:\BigData\Spark\python\pyspark\rdd.py", line 1464, in first
    rs = self.take(1)
  File "E:\BigData\Spark\python\pyspark\rdd.py", line 1446, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "E:\BigData\Spark\python\pyspark\context.py", line 1120, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1304, in __call__
  File "E:\BigData\Spark\python\pyspark\sql\utils.py", line 128, in deco
    return f(*a, **kw)
  File "E:\BigData\Spark\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, DESKTOP-5SER7G7, executor driver): java.net.SocketException: Connection reset by peer: socket write error
        at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
        at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
        at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:465)
        at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:285)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:295)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)

Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2135)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2154)
        at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:154)
        at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.net.SocketException: Connection reset by peer: socket write error
        at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
        at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:150)
        at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
        at java.base/java.io.BufferedOutputStream.write(BufferedOutputStream.java:127)
        at java.base/java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.base/java.io.FilterOutputStream.write(FilterOutputStream.java:108)
        at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:465)
        at org.apache.spark.api.python.PythonRDD$.write$1(PythonRDD.scala:285)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRDD$.$anonfun$writeIteratorToStream$1$adapted(PythonRDD.scala:295)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:295)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:607)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:383)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:218)


F:\Studia\02_Magisterskie\04_Semestr4\03_PSD\01_Projekt\Portfel Inwestycyjny\Portfolio-Monitor>SUCCESS: The process with PID 22172 (child process of PID 25264) has been terminated.
SUCCESS: The process with PID 25264 (child process of PID 33120) has been terminated.
SUCCESS: The process with PID 33120 (child process of PID 26728) has been terminated. 

How did I try to resolve these bugs?

To fix this, I tried changing the Spark session configuration:

  • by changing the size of the driver's memory
  • enabling offHeap with its memory allocated
  • enabling dynamic allocation of executors

Currently, the enabled options are visible in the code. Unfortunately changing the configuration did not help and the problem still exists. I also tried to change the size of the window but it didn't help either.

QUESTIONS

Do you know how to resolve these bugs? How can I make remote open for larger payloads?

Code

import os
findspark.init('E:\BigData\Spark')

from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import pandas_udf 
import pandas as pd
from pyspark.sql.window import Window
import pyspark.sql.functions as F

DATA_FILE_PATHNAME = '\\'.join(os.path.dirname(__file__).split('\\')[:-1])+'\\'+'data'+'\\'+'yields_with_rounded_numbers_full_dataset.txt'
METRICS_FILE_PATHNAME = '\\'.join(os.path.dirname(__file__).split('\\')[:-1])+'\\'+'data'+'\\'+'metrics.txt'
ASSETS_NUM = 6
NUM_OF_RECORDS_PER_ASSET = 500000
NUM_OF_DATA_IN_FULL_PORTFOLIO = 6*NUM_OF_RECORDS_PER_ASSET
WINDOW_SIZE = 30


def mapper(line):
   fields = line.split()
   return Row(ID=int(fields[0]),asset_1 = float(fields[1]), asset_2 = float(fields[2]), asset_3 = float(fields[3]),asset_4 = float(fields[4]), asset_5 = float(fields[5]), asset_6 = float(fields[6]), grouper = 1)


if __name__ == "__main__":
   # Spark config
   spark = SparkSession.builder.master('local[*]')\
                               .config('spark.sql.execution.arrow.pyspark.enabled', True) \
                               .config('spark.sql.session.timeZone', 'UTC') \
                               .config('spark.driver.memory','30g') \
                               .config("spark.eventLog.enabled","true")\
                               .config('spark.ui.showConsoleProgress', True) \
                               .config("spark.memory.offHeap.enabled",True)\
                               .config("spark.memory.offHeap.size","10g") \
                               .config('spark.sql.repl.eagerEval.enabled', True).appName("SparkSQL").getOrCreate()
                               # .config("spark.dynamicAllocation.enabled", "true") \
                               # .config("spark.executor.cores", 4) \
                               # .config("spark.dynamicAllocation.minExecutors","1") \
                               # .config("spark.dynamicAllocation.maxExecutors","5") \
                                                               # .config("spark.executor.memory", "10g")\

                               #.config("spark.executor.instances", 4)\
                               #.config("spark.executor.memory", "1g")\
                               
   lines = spark.sparkContext.textFile(DATA_FILE_PATHNAME, minPartitions = 20)
   # map values for columns 
   assets_with_yields_rdd = lines.map(mapper)          
   assets_with_yields_df = spark.createDataFrame(assets_with_yields_rdd).cache()

   @pandas_udf("double")
   def mean_of_assets_in_win(asset: pd.Series) -> float:
       asset_mean = asset.mean()
       return asset_mean

  
   # WINDOWING
   # sliding - window settings
   sliding_window = Window.orderBy(F.col("ID")).rowsBetween(Window.currentRow, WINDOW_SIZE-1)
   a1_mean_win = assets_with_yields_df.select(mean_of_assets_in_win('asset_1').over(sliding_window)).collect()

   spark.stop()

If there is no specific reason to use custom Pandas UDF just for calculating mean, consider using the avg function provided by Spark standard lib. In my case this simplified code takes ~5 seconds to execute on 1GB of heap memory:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as F

WINDOW_SIZE = 30
spark = SparkSession.builder.getOrCreate()

assets_with_yields_df = spark.range(500000) \
        .withColumn('asset_1', F.col('id') * 10)

sliding_window = Window.orderBy(F.col("ID")).rowsBetween(Window.currentRow, WINDOW_SIZE-1)
a1_mean_win = assets_with_yields_df.select(F.avg('asset_1').over(sliding_window).alias('windowed_mean'))
a1_mean_win.write.parquet('results', mode='overwrite')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM