简体   繁体   English

将 Scala kernel 与 Spark 一起使用

[英]Using Scala kernel with Spark

I have a problem with accessing data from S3 from Spark.我在从 Spark 访问 S3 数据时遇到问题。 I have spylon-kernel installed for JupyterHub (which is Scala kernel with Spark framework integrtation).我为 JupyterHub 安装了JupyterHub spylon-kernel (即 Scala kernel 与 Spark 框架集成)。 It uses pyspark .它使用pyspark Unfortunately the newest pyspark still uses hadoop-2.7.3 libraries.不幸的是,最新的 pyspark 仍然使用hadoop-2.7.3库。 When I'm trying to access S3 bucket in Frankfurt region I get following Java exception:当我尝试访问法兰克福地区的 S3 存储桶时,我得到以下 Java 异常:

" com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: xxxxxxxxxx, AWS Error Code: null, AWS Error Message: Bad Request " " com.amazonaws.services.s3.model.AmazonS3Exception:状态代码:400,AWS 服务:Amazon S3,AWS 请求 ID:xxxxxxxxxx,AWS 错误代码:Z37A6259CC6648DFF0BD9A7 AWS 错误消息6648DFF0BD9A7

From my research it looks like it's hadoop 2.7.3 problem.根据我的研究,它看起来像是hadoop 2.7.3问题。 With newer versions (3.1.1) it works well locally but pyspark uses those hadoop 2.7.3 jars and looks like it can't be changed.对于较新的版本(3.1.1) ,它在本地运行良好,但pyspark使用那些hadoop 2.7.3 jars 并且看起来无法更改。 Can I do something about it?我能做点什么吗? Maybe there is some way to tell pyspark to use hadoop 3.1.1 jars?也许有一些方法可以告诉pyspark使用hadoop 3.1.1 jars? Or maybe there is other Scala kernel with Spark for Jupyterhub which uses spark-shell instead of pyspark ?或者也许还有其他 Scala kernel 与 Spark for Jupyterhub使用spark-shell而不是pyspark

Ok, I finally fixed it... I will post an answer, maybe it will be useful for someone.好的,我终于修好了......我会发布一个答案,也许它会对某人有用。

pip install toree

jupyter toree install --spark_home /path/to/your/spark/ --interpreters=Scala

This one works: :)这个有效::)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM