简体   繁体   English

AWS EMR,EMR步骤中的python pyspark脚本

[英]AWS EMR, python pyspark script in EMR steps

I try to run a very simple pyspark script as a step in aws emr which looks like: 我尝试在aws emr中运行一个非常简单的pyspark脚本,如下所示:

from pyspark.sql import SparkSession
sc = SparkContext()
df = sc.read.csv("s3://folder1/file.csv",header=True,inferSchema=True)
dd=df.select(df)
write_to = "s3://spark-workflow-test/"
dd.write.csv(write_to, sep = ";", header = True)
sc.stop()

It reads some file from a folder, selects a column, and writes it to another file in a bucket. 它从文件夹中读取一些文件,选择一列,然后将其写入存储桶中的另一个文件。 For some reason it keeps failing and i cant figure out why. 由于某种原因,它不断失败,我不知道为什么。

This script works fine in local spark, but in an emr step it keeps failing and giving an exitCode=13. 该脚本在本地火花中工作正常,但在emr步骤中,它始终失败并给出exitCode = 13。 Is there are problem in the code, a spark configuration or do i need to do something in the console/emr infterface? 代码,spark配置是否有问题,或者我需要在控制台/ emr界面上做些什么? I really have no clue about where to look for a solution. 我真的不知道在哪里寻找解决方案。

I think your error is the same then in this issue. 我认为您的错误与问题相同。

Your spark context definition seems off. 您的spark上下文定义似乎已关闭。 Replace it with : 替换为:

sc = SparkSession.builder.getOrCreate()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM