简体   繁体   English

如何将Python连接到Spark会话并保持RDD活着

[英]How to Connect Python to Spark Session and Keep RDDs Alive

How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs? 如何将一个小的Python脚本挂钩到现有的Spark实例并对现有的RDD进行操作?

I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. 我正处于在Windows 10上使用Spark的早期阶段,在“本地”实例上尝试脚本。 I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). 我正在使用Spark的最新稳定版本(适用于Hadoop 2.7的Spark 2.0.1)。 I've installed and set environment variables for Hadoop 2.7.3. 我已经为Hadoop 2.7.3安装并设置了环境变量。 I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python. 我正在用Python试验Pyspark shell和Visual Studio 2015社区。

I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. 我正在尝试构建一个大型引擎,我将运行单个脚本来加载,按摩,格式化和访问数据。 I'm sure there's a normal way to do that; 我确信有一个正常的方法可以做到这一点; isn't that the point of Spark? 那不是Spark的重点吗?

Anyway, here's the experience I have so far. 无论如何,这是我迄今为止的经历。 This is generally to be expected. 这通常是预料之中的。 When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. 当我在Python中构建一个小的Spark脚本并使用Visual Studio运行它时,脚本会运行,完成它的工作并退出。 In the process of exiting, it also exits the Spark Context it was using. 在退出的过程中,它也退出它正在使用的Spark上下文。

So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? 所以我有以下想法:如果我在Pyspark中启动了持久的Spark上下文,然后在每个Python脚本中设置我的SparkConf和SparkContext以连接到Spark上下文怎么办? So, looking up online what the defaults are for Pyspark, I tried the following: 因此,在线查看Pyspark的默认设置,我尝试了以下内容:

conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)

I started Pyspark. 我开始了Pyspark。 In a separate script in Visual Studio, I used this code for SparkContext. 在Visual Studio的单独脚本中,我将此代码用于SparkContext。 I loaded a text file into an RDD named RDDFromFilename . 我将一个文本文件加载到名为RDDFromFilename的RDD中。 But I couldn't access that RDD in the Pyspark shell once the script had run. 但是一旦脚本运行,我无法在Pyspark shell中访问该RDD。

How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? 如何启动持久性Spark Context,在一个Python脚本中创建RDD,并从后续Python脚本访问该RDD? Particularly in Windows? 特别是在Windows?

There is no solution in Spark. Spark中没有解决方案。 You may consider: 你可以考虑:

I think that out of these only Zeppelin officially supports Windows. 我认为只有这些只有Zeppelin正式支持Windows。

For those who may follow: I've recently discovered SnappyData. 对于那些可能关注的人:我最近发现了SnappyData。

SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. SnappyData仍然相当年轻,并且有一点学习曲线,但它承诺做的是创建一个可以在多个Spark作业之间共享的持久可变SQL集合,并且可以作为RDD和DataFrame本地访问。 It has a job server that you can dump concurrent jobs onto. 它有一个作业服务器,您可以将并发作业转储到。

It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data. 它本质上是GemFire内存数据库与同一JVM中本地的Spark集群的组合,所以(当我管理它的时候很不错)我可以完成没有单机瓶颈的大型任务来管理数据进出Spark ,或者我甚至可以进行实时数据操作,而另一个Spark程序在相同的数据上运行。

I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems. 我知道这是我自己的答案,但我可能不会将其标记为答案 ,直到我足够成熟,对如何很好地解决了我的问题的意见。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM