简体   繁体   中英

Too many instances of a DoFn in a Dataflow streaming pipeline

I am currently developing a dataflow streaming pipeline that has many interactions with Cloud SQL. The pipeline interact with an instance of Postgres in Cloud SQL through the Python connector developed by Google . It connects through DoFn functions that inherit from a DoFn class "CloudSQLDoFn" that handles a pool of connexions trough setup() and teardown() calls. In total, we have 16 DoFns that inherit from this CloudSQLDoFn class.

import apache_beam as beam
from google.cloud.sql.connector import Connector, IPTypes
from sqlalchemy import create_engine

INSTANCE_CONNECTION_NAME = ########
DB_USER = #########
DB_PASS = #########
DB_NAME = #########
POOL_SIZE = 5

class CloudSqlDoFn(beam.DoFn):
    def __init__(
        self,
        local
    ):
        self.local = local

        self.connected_pool = None

        self.instance_connexion_name = INSTANCE_CONNECTION_NAME
        self.db_user = DB_USER
        self.db_pass = DB_PASS
        self.db_name = DB_NAME

        self.pool_size = POOL_SIZE

    def get_conn(self):
        """Create connexion"""

        conn = Connector().connect(
            self.instance_connexion_name,
            "pg8000",
            user=self.db_user,
            password=self.db_pass,
            db=self.db_name,
            ip_type=IPTypes.PRIVATE
        )
        return conn

    def get_pool(self):
        """Create pool of connexion"""
        pool = create_engine(
            "postgresql+pg8000://",
            creator=self.get_conn,
            pool_size=self.pool_size,
            pool_recycle=1800
        )
        return pool

    def setup(self):
        """Open connection or pool of connections to Postgres"""
        self.connected_pool = self.get_pool()
     

    def teardown(self):
        """Close connection to Postgres"""
        self.connected_pool.dispose()

In a word, we are facing a typical "backpressure" problem: we received a lot of "Too Many Requests" errors from the "Cloud SQL Admin Server" (that sets the SQL connection) when too many files arrive at the same time.

RuntimeError: aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('https://sqladmin.googleapis.com/sql/v1beta4/projects/.../instances/db-csql:generateEphemeralCert')

We know that this is due to the creation of many DoFns instances which are calling the setup() method and therefore request too many connections but we are not able to control the number of connections. We thought that by limiting the maximum number of workers and threads, we could force the latency to go up (which would be OK) but it seems like other parameters determine the number of instances of a DoFn.

My questions:

  • Aside from the number of threads and workers, what determines the number of instances of a DoFn instantiated at the same time in a streaming Dataflow?
  • How could we force the system to accept a higher latency/lower freshness so that we don't saturate the Cloud SQL Admin server?

Thank you for your help.

You can make your pool process-level (ie attach it to some global static variable/module, or the DoFn class itself) and share it among all DoFn instances to limit the number of connections per process regardless of the number of DoFns instantiated. If you need more than one of these, you can give each DoFn a unique identifier, and then have a static map of ids to pools.

On Dataflow, you can also set no_use_multiple_sdk_containers to limit the number of processes per worker VM (though this of course will limit CPU in other parts of your pipeline as well).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM