简体   繁体   English

如何在pyspark代码中创建python线程

[英]How to create python threads in pyspark code

I have around 70 hive queries which I am executing in pyspark in sequence. 我有大约70个蜂巢查询,我正在pyspark中按顺序执行。 I am looking at ways to improve the runtime be running the hive queries in parallel. 我正在研究通过并行运行配置单元查询来改善运行时的方法。 I am planning to do this by by creating python threads and running the sqlContext.sql in the threads. 我计划通过创建python线程并在线程中运行sqlContext.sql来做到这一点。 Would this create threads in driver and improve performance. 这会在驱动程序中创建线程并提高性能。

I am assuming, you do not have any dependency on these hive queries and so they can run in parallel. 我假设,您对这些配置单元查询没有任何依赖性,因此它们可以并行运行。 You can accomplish this by threading, but not sure of the benefit in a single user application - because the total number of resources is fixed for your cluster ie the total time to finish the all the queries will be the same - as the spark scheduler will round robing across these individual jobs - when you multi thread it. 您可以通过线程来完成此操作,但不确定在单个用户应用程序中的好处-因为群集的资源总数是固定的,即完成所有查询的总时间将与Spark调度程序相同对这些单个作业进行轮巡-当您多线程时。

https://spark.apache.org/docs/latest/job-scheduling.html explains this 1) SPARK by default uses a FIFO scheduler ( which you are observing) 2) By threading you can use a "fair" scheduler 3) Ensure the method that is being threaded -set this sc.setLocalProperty("spark.scheduler.pool", ) 4) The pool id needs to be different for each thread https://spark.apache.org/docs/latest/job-scheduling.html对此进行了说明1)默认情况下,SPARK使用FIFO调度程序(您正在观察)2)通过线程化,您可以使用“公平”调度程序3)确保要线程化的方法-设置此sc.setLocalProperty(“ spark.scheduler.pool”,)4)每个线程的池ID需要不同

Example use case of threading from a code perspective: 从代码角度看线程的示例用例:

# set the spark context to use a fair scheduler mode
conf = SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
sc = new SparkContext(conf)


# runs a query taking a spark context, pool_id and query..
def runQuery(sc,<POOL_ID>,query):
    sc.setLocalProperty("spark.scheduler.pool", pool_id)
    .....<your code>
    return df

t1 = threading.thread(target=runQuery,args=(sc,"1",<query1>)
t2 = threading.thread(target=runQuery,args=(sc,"2",<query2>)

# start the threads...
t1.start()
t2.sart()

# wait for the threads to complete and get the returned data frames...
df1 = t1.join()
df2 = t2.join()


Like the spark documentation indicates, you will not observe an improvement in the overall throughput.. it is suited for multi-user sharing of resources. 就像spark文档指出的那样,您不会在总体吞吐量上发现任何改进。它适合于多用户共享资源。 Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM