Pyspark 錯誤 self._sock.recv_into(b) socket.timeout: 超時

Question

目標是使用 UDF 對行進行分類。 我在 windows 上使用 pyspark。

使用像過濾器這樣的簡單函數或操作似乎可以工作。

有關如何解決超時/套接字故障的任何指導都會有所幫助（請參閱下面的錯誤）。

數據中沒有空值。

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType,StringType

def BreakDown(arr_value):
    start_year = arr_value[0]
    start_month = arr_value[1]
    end_year = arr_value[2]
    end_month = arr_value[3]
    curr_year = arr_value[4]
    curr_month = arr_value[5]
    if   (curr_year == start_year) & (curr_month >= start_month) : return 1
    elif   (curr_year == end_year) & (curr_month <= end_month) : return 1
    elif   (curr_year > start_year) & (curr_year < end_year) : return 1
    else: return 0

    
udfBreakDown = udf(BreakDown, IntegerType())

temp = temp.withColumn('include', udfBreakDown(F.struct('start_year','start_month','end_year','end_month','curr_year','curr_month')))

PythonException：從 Python 工作程序拋出異常。 請參閱下面的堆棧跟蹤。 回溯（最近一次通話最后）：
文件“E:\spark\spark-3.0.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”，第 585 行，在主文件“E:\spark\spark-3.0. 1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py"，第 593 行，在 read_int 長度 = stream.read(4) 文件“C:\ProgramData\Anaconda3\lib\socket.py "，第 669 行，在 readinto 中返回 self._sock.recv_into(b) socket.timeout：超時

Answer 1

當您可以使用 Spark 內置函數時，請始終避免使用 UDF。 您可以使用when function 重寫您的邏輯，如下所示：

from pyspark.sql import functions as F

def get_include_col():
    c = F.when((F.col("curr_year") == F.col("start_year")) & (F.col("curr_month") >= F.col("start_month")), F.lit(1)) \
        .when((F.col("curr_year") == F.col("end_year")) & (F.col("curr_month") <= F.col("end_month")), F.lit(1)) \
        .when((F.col("curr_year") > F.col("start_year")) & (F.col("curr_year") < F.col("end_year")), F.lit(1)) \
        .otherwise(F.lit(0))
    return c


temp = temp.withColumn('include', get_include_col())

您還可以使用functools.reduce動態生成 when 表達式，而無需將它們全部記錄下來。 例如：

import functools
from pyspark.sql import functions as F

cases = [
    ("curr_year = start_year and curr_month >= start_month", 1),
    ("curr_year = end_year and curr_month <= end_month", 1),
    ("curr_year > start_year and curr_year < end_year", 1)
]

include_col = functools.reduce(
    lambda acc, x: acc.when(F.expr(x[0]), F.lit(x[1])),
    cases,
    F
).otherwise(F.lit(0))

temp = temp.withColumn('include', include_col)

Pyspark 錯誤 self._sock.recv_into(b) socket.timeout: 超時

問題描述

1 個解決方案

解決方案1
1 已采納 2021-02-07 18:24:41

Pyspark 錯誤 self._sock.recv_into(b) socket.timeout: 超時

問題描述

1 個解決方案

解決方案1 1 已采納 2021-02-07 18:24:41

解決方案1
1 已采納 2021-02-07 18:24:41