[英]Start CloudSQL Proxy on Python Dataflow / Apache Beam
我目前正在從事 ETL Dataflow 作業(使用 Apache Beam Python SDK),該作業從 CloudSQL(使用psycopg2
和自定義ParDo
)查詢數據並將其寫入 BigQuery。 我的目標是創建一個數據流模板,我可以使用 Cron 作業從 AppEngine 開始。
我有一個使用 DirectRunner 在本地工作的版本。 為此,我使用 CloudSQL (Postgres) 代理客戶端,以便可以連接到 127.0.0.1 上的數據庫。
當使用帶有自定義命令的 DataflowRunner 在 setup.py 腳本中啟動代理時,作業不會執行。 它堅持重復此日志消息:
Setting node annotation to enable volume controller attach/detach
我的 setup.py 的一部分如下所示:
CUSTOM_COMMANDS = [
['echo', 'Custom command worked!'],
['wget', 'https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64', '-O', 'cloud_sql_proxy'],
['echo', 'Proxy downloaded'],
['chmod', '+x', 'cloud_sql_proxy']]
class CustomCommands(setuptools.Command):
"""A setuptools Command class able to run arbitrary commands."""
def initialize_options(self):
pass
def finalize_options(self):
pass
def RunCustomCommand(self, command_list):
print('Running command: %s' % command_list)
logging.info("Running custom commands")
p = subprocess.Popen(
command_list,
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
# Can use communicate(input='y\n'.encode()) if the command run requires
# some confirmation.
stdout_data, _ = p.communicate()
print('Command output: %s' % stdout_data)
if p.returncode != 0:
raise RuntimeError(
'Command %s failed: exit code: %s' % (command_list, p.returncode))
def run(self):
for command in CUSTOM_COMMANDS:
self.RunCustomCommand(command)
subprocess.Popen(['./cloud_sql_proxy', '-instances=bi-test-1:europe-west1:test-animal=tcp:5432'])
在從sthomp閱讀 Github 上的這個問題和 Stackoverflo 上的討論之后,我在run()
中添加了最后一行作為單獨的 subprocess.Popen subprocess.Popen()
)。 我還嘗試使用subprocess.Popen
的一些參數。
brodin提到的另一個解決方案是允許從每個 IP 地址訪問並通過用戶名和密碼進行連接。 據我了解,他並不認為這是最佳實踐。
預先感謝您的幫助。
!!! 這篇文章底部的解決方法!!!
這些是作業期間發生的錯誤級別的日志:
E EXT4-fs (dm-0): couldn't mount as ext3 due to feature incompatibilities
E Image garbage collection failed once. Stats initialization may not have completed yet: unable to find data for container /
E Failed to check if disk space is available for the runtime: failed to get fs info for "runtime": unable to find data for container /
E Failed to check if disk space is available on the root partition: failed to get fs info for "root": unable to find data for container /
E [ContainerManager]: Fail to get rootfs information unable to find data for container /
E Could not find capacity information for resource storage.kubernetes.io/scratch
E debconf: delaying package configuration, since apt-utils is not installed
E % Total % Received % Xferd Average Speed Time Time Time Current
E Dload Upload Total Spent Left Speed
E
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 3698 100 3698 0 0 25674 0 --:--:-- --:--:-- --:--:-- 25860
#-- HERE IS WHEN setup.py FOR MY JOB IS EXECUTED ---
E debconf: delaying package configuration, since apt-utils is not installed
E insserv: warning: current start runlevel(s) (empty) of script `stackdriver-extractor' overrides LSB defaults (2 3 4 5).
E insserv: warning: current stop runlevel(s) (0 1 2 3 4 5 6) of script `stackdriver-extractor' overrides LSB defaults (0 1 6).
E option = Interval; value = 60.000000;
E option = FQDNLookup; value = false;
E Created new plugin context.
E option = PIDFile; value = /var/run/stackdriver-agent.pid;
E option = Interval; value = 60.000000;
E option = FQDNLookup; value = false;
E Created new plugin context.
在這里您可以找到我的自定義 setup.py 啟動后的所有日志(日志級別:任何;所有日志):
作業日志(我在沒有卡住一段時間后手動取消了作業):
2018-06-08 (08:02:20) Autoscaling is enabled for job 2018-06-07_23_02_20-5917188751755240698. The number of workers will b...
2018-06-08 (08:02:20) Autoscaling was automatically enabled for job 2018-06-07_23_02_20-5917188751755240698.
2018-06-08 (08:02:24) Checking required Cloud APIs are enabled.
2018-06-08 (08:02:24) Checking permissions granted to controller Service Account.
2018-06-08 (08:02:25) Worker configuration: n1-standard-1 in europe-west1-b.
2018-06-08 (08:02:25) Expanding CoGroupByKey operations into optimizable parts.
2018-06-08 (08:02:25) Combiner lifting skipped for step Save new watermarks/Write/WriteImpl/GroupByKey: GroupByKey not fol...
2018-06-08 (08:02:25) Combiner lifting skipped for step Group watermarks: GroupByKey not followed by a combiner.
2018-06-08 (08:02:25) Expanding GroupByKey operations into optimizable parts.
2018-06-08 (08:02:26) Lifting ValueCombiningMappingFns into MergeBucketsMappingFns
2018-06-08 (08:02:26) Annotating graph with Autotuner information.
2018-06-08 (08:02:26) Fusing adjacent ParDo, Read, Write, and Flatten operations
2018-06-08 (08:02:26) Fusing consumer Get rows from CloudSQL tables into Begin pipeline with watermarks/Read
2018-06-08 (08:02:26) Fusing consumer Group watermarks/Write into Group watermarks/Reify
2018-06-08 (08:02:26) Fusing consumer Group watermarks/GroupByWindow into Group watermarks/Read
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WriteBundles/WriteBundles into Save new watermar...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/GroupByWindow into Save new watermark...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Reify into Save new watermarks/Write/...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/GroupByKey/Write into Save new watermarks/Write/...
2018-06-08 (08:02:26) Fusing consumer Write to BQ into Get rows from CloudSQL tables
2018-06-08 (08:02:26) Fusing consumer Group watermarks/Reify into Write to BQ
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/Map(<lambda at iobase.py:926>) into Convert dict...
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/WindowInto(WindowIntoFn) into Save new watermark...
2018-06-08 (08:02:26) Fusing consumer Convert dictionary list to single dictionary and json into Remove "watermark" label
2018-06-08 (08:02:26) Fusing consumer Remove "watermark" label into Group watermarks/GroupByWindow
2018-06-08 (08:02:26) Fusing consumer Save new watermarks/Write/WriteImpl/InitializeWrite into Save new watermarks/Write/W...
2018-06-08 (08:02:26) Workflow config is missing a default resource spec.
2018-06-08 (08:02:26) Adding StepResource setup and teardown to workflow graph.
2018-06-08 (08:02:26) Adding workflow start and stop steps.
2018-06-08 (08:02:26) Assigning stage ids.
2018-06-08 (08:02:26) Executing wait step start25
2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/DoOnce/Read+Save new watermarks/Write/WriteI...
2018-06-08 (08:02:26) Executing operation Save new watermarks/Write/WriteImpl/GroupByKey/Create
2018-06-08 (08:02:26) Starting worker pool setup.
2018-06-08 (08:02:26) Executing operation Group watermarks/Create
2018-06-08 (08:02:26) Starting 1 workers in europe-west1-b...
2018-06-08 (08:02:27) Value "Group watermarks/Session" materialized.
2018-06-08 (08:02:27) Value "Save new watermarks/Write/WriteImpl/GroupByKey/Session" materialized.
2018-06-08 (08:02:27) Executing operation Begin pipeline with watermarks/Read+Get rows from CloudSQL tables+Write to BQ+Gr...
2018-06-08 (08:02:36) Autoscaling: Raised the number of workers to 0 based on the rate of progress in the currently runnin...
2018-06-08 (08:02:46) Autoscaling: Raised the number of workers to 1 based on the rate of progress in the currently runnin...
2018-06-08 (08:03:05) Workers have started successfully.
2018-06-08 (08:11:37) Cancel request is committed for workflow job: 2018-06-07_23_02_20-5917188751755240698.
2018-06-08 (08:11:38) Cleaning up.
2018-06-08 (08:11:38) Starting worker pool teardown.
2018-06-08 (08:11:38) Stopping worker pool...
2018-06-08 (08:12:30) Autoscaling: Reduced the number of workers to 0 based on the rate of progress in the currently runni...
堆棧跟蹤:
No errors have been received in this time period.
我終於找到了解決方法。 我的想法是通過 CloudSQL 實例的公共 IP 進行連接。 為此,您需要允許從每個 IP 連接到您的 CloudSQL 實例:
Authorization
”選項卡Add network
並添加0.0.0.0/0
( !!這將允許每個 IP 地址連接到您的實例!! )為了增加流程的安全性,我使用了 SSL 密鑰並且只允許 SSL 連接到實例:
SSL
選項卡Create a new certificate
為您的服務器創建 SSL 證書Create a client certificate
證書為您的客戶端創建 SSL 證書Allow only SSL connections
以拒絕所有無 SSL 連接嘗試之后,我將證書存儲在 Google Cloud Storage 存儲桶中並在連接 Dataflow 作業之前加載它們,即:
import psycopg2
import psycopg2.extensions
import os
import stat
from google.cloud import storage
# Function to wait for open connection when processing parallel
def wait(conn):
while 1:
state = conn.poll()
if state == psycopg2.extensions.POLL_OK:
break
elif state == psycopg2.extensions.POLL_WRITE:
pass
select.select([], [conn.fileno()], [])
elif state == psycopg2.extensions.POLL_READ:
pass
select.select([conn.fileno()], [], [])
else:
raise psycopg2.OperationalError("poll() returned %s" % state)
# Function which returns a connection which can be used for queries
def connect_to_db(host, hostaddr, dbname, user, password, sslmode = 'verify-full'):
# Get keys from GCS
client = storage.Client()
bucket = client.get_bucket(<YOUR_BUCKET_NAME>)
bucket.get_blob('PATH_TO/server-ca.pem').download_to_filename('server-ca.pem')
bucket.get_blob('PATH_TO/client-key.pem').download_to_filename('client-key.pem')
os.chmod("client-key.pem", stat.S_IRWXU)
bucket.get_blob('PATH_TO/client-cert.pem').download_to_filename('client-cert.pem')
sslrootcert = 'server-ca.pem'
sslkey = 'client-key.pem'
sslcert = 'client-cert.pem'
con = psycopg2.connect(
host = host,
hostaddr = hostaddr,
dbname = dbname,
user = user,
password = password,
sslmode=sslmode,
sslrootcert = sslrootcert,
sslcert = sslcert,
sslkey = sslkey)
return con
然后我在自定義ParDo
中使用這些函數來執行查詢。
最小的例子:
import apache_beam as beam
class ReadSQLTableNames(beam.DoFn):
'''
parDo class to get all table names of a given cloudSQL database.
It will return each table name.
'''
def __init__(self, host, hostaddr, dbname, username, password):
super(ReadSQLTableNames, self).__init__()
self.host = host
self.hostaddr = hostaddr
self.dbname = dbname
self.username = username
self.password = password
def process(self, element):
# Connect do database
con = connect_to_db(host = self.host,
hostaddr = self.hostaddr,
dbname = self.dbname,
user = self.username,
password = self.password)
# Wait for free connection
wait_select(con)
# Create cursor to query data
cur = con.cursor(cursor_factory=RealDictCursor)
# Get all table names
cur.execute(
"""
SELECT
tablename as table
FROM pg_tables
WHERE schemaname = 'public'
"""
)
table_names = cur.fetchall()
cur.close()
con.close()
for table_name in table_names:
yield table_name["table"]
管道的一部分可能如下所示:
# Current workaround to query all tables:
# Create a dummy initiator PCollection with one element
init = p |'Begin pipeline with initiator' >> beam.Create(['All tables initializer'])
tables = init |'Get table names' >> beam.ParDo(ReadSQLTableNames(
host = known_args.host,
hostaddr = known_args.hostaddr,
dbname = known_args.db_name,
username = known_args.user,
password = known_args.password))
我希望這個解決方案可以幫助其他有類似問題的人
我設法找到了更好或至少更簡單的解決方案。 在 DoFn 設置功能中使用雲代理設置預連接
class MyDoFn(beam.DoFn):
def setup(self):
os.system("wget https://dl.google.com/cloudsql/cloud_sql_proxy.linux.amd64 -O cloud_sql_proxy")
os.system("chmod +x cloud_sql_proxy")
os.system(f"./cloud_sql_proxy -instances={self.sql_args['cloud_sql_connection_name']}=tcp:3306 &")
2022 年要做的簡單而正確的事情是使用雲 sql 連接器,它將與在 gcloud sql 上運行的 postgres、sqlserver 和 mysql 一起使用。
https://cloud.google.com/sql/docs/mysql/connect-connectors#python_1
https://pypi.org/project/cloud-sql-python-connector/
無需將 IP 列入白名單、手動加載證書或讓您的數據庫完全開放。 您對主機使用此格式:“project:region:instance”
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.