简体繁体中英

Connection pooling in a streaming pyspark application

原文 2016-07-07 21:55:39 2 1 python/ apache-spark/ pyspark/ connection-pooling/ spark-streaming

What is the proper way of using connection pools in a streaming pyspark application ?

I read through https://forums.databricks.com/questions/3057/how-to-reuse-database-session-object-created-in-fo.html and understand the proper way is to use a singleton for scala/java. Is this possible in python ? A small code example would be greatly appreciated. I believe creating a connection perPartition will be very inefficient for a streaming application.

1 answers

Long story short connection pools will be less useful in Python than on JVM due to PySpark architecture. Unlike its Scala counterpart Python executors use separate processes. It means there is no shared state between executors and since by default each partition is processed sequentially you can have only one active connection per interpreter.

Of course it can be still useful to maintain connections between batches. To achieve that you'll need two things:

spark.python.worker.reuse has to be set to true.
A way to reference an object between different calls.

The first one is pretty obvious and the second one is not really Spark specific. You can for example use module singleton (you'll find Spark example in my answer to How to run a function on all Spark workers before processing data in PySpark? ) or a Borg pattern .

How to implement connection pooling for a python application connecting to Vertica or PostgreSQL?

Django Postgres Connection Pooling

aiohtttp connection pooling with ProxyConnector

Django XMPP Connection pooling

Pony orm connection pooling

Connection Pooling in Mongo with PyMongo

New to Connection Pooling

Python MySQL connection pooling

Connection pooling for sql alchemy and postgres

Celery Worker Database Connection Pooling

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to implement connection pooling for a python application connecting to Vertica or PostgreSQL? Django Postgres Connection Pooling aiohtttp connection pooling with ProxyConnector Django XMPP Connection pooling Pony orm connection pooling Connection Pooling in Mongo with PyMongo New to Connection Pooling Python MySQL connection pooling Connection pooling for sql alchemy and postgres Celery Worker Database Connection Pooling

Related Tags

Connection pooling in a streaming pyspark application

Question

1 answers

solution1 2 2016-07-08 13:50:14

solution1
2 2016-07-08 13:50:14