我在 Pyspark 中使用 DataFrame 显示方法时出错

Question

I try to show the Pyspark Dataframe, and I encounter such error:我尝试显示 Pyspark Dataframe，但遇到了这样的错误：

Py4JJavaError: An error occurred while calling o607.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 114.0 failed 4 times, most recent failure: Lost task 0.3 in stage 114.0 (TID 15904, zw02-data-hdp-dn25211.mt, executor 416): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/data5/hadoop/yarn/nm-local-dir/usercache/hadoop-hmart-peisongpa/appcache/application_1634562540530_1814236/container_e37_1634562540530_1814236_01_001496/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
    return lambda *a: f(*a)
  File "<ipython-input-12-2ecb67285c3b>", line 5, in <lambda>
  File "<ipython-input-12-2ecb67285c3b>", line 4, in convert_target
TypeError: int() argument must be a string, a bytes-like object or a number, not 'DenseVector'

This is my code, and it runs on Jupyter:这是我的代码，它在 Jupyter 上运行：

df2 = spark.sql(sql_text)
assembler = VectorAssembler(inputCols=["targetstep"], outputCol="x_vec")
scaler = MinMaxScaler(inputCol="x_vec", outputCol="targetstep_scaled")
pipeline = Pipeline(stages=[assembler, scaler])
scalerModel = pipeline.fit(df2)
df2 = scalerModel.transform(df2)
df2 = df2.withColumn('targetstep',target_udf(f.col('targetstep_scaled'))).drop('x_vec')
df2.show()

I'm sure that the Pipeline and withColumn() is ok.我确定 Pipeline 和withColumn() 。 but I don't konw why the show method fails.但我不知道为什么 show 方法会失败。

Answer 1

PySpark DF are lazy loading. PySpark DF 是延迟加载的。

When you call .show() you are asking the prior steps to execute and anyone of them may not work, you just can't see it until you call .show() because they haven't executed.当您调用 .show() 时，您要求执行先前的步骤，并且其中任何一个都可能不起作用，直到您调用 .show() 才能看到它，因为它们尚未执行。

I go back to earlier steps and call .collect() on each operation of the DF.我回到前面的步骤并在 DF 的每个操作上调用 .collect()。 This will at least allow you to isolate where the bad data was created.这至少可以让您隔离坏数据的创建位置。

我在 Pyspark 中使用 DataFrame 显示方法时出错

问题描述

1 个解决方案

解决方案1
0 2022-01-07 17:16:22

我在 Pyspark 中使用 DataFrame 显示方法时出错

问题描述

1 个解决方案

解决方案1 0 2022-01-07 17:16:22

解决方案1
0 2022-01-07 17:16:22