[英]Reading data from MySQL and writing in GCP Bucket using Apache Beam Python API
我正在尝试从 MySQL 数据库(位于 GCP 中)读取数据并将其写入 GCP Bucket。 我想同样使用 Python SDK。 下面是我写的代码。
from __future__ import generators
import apache_beam as beam
import time
import jaydebeapi
import os
import argparse
from google.cloud import bigquery
import logging
import sys
from string import lower
from google.cloud import storage as gstorage
import pandas as pd
from oauth2client.client import GoogleCredentials
print("Import Successful")
class setenv(beam.DoFn):
def process(self,context):
os.system('gsutil cp gs://<MY_BUCKET>/mysql-connector-java-5.1.45.jar /tmp/')
logging.info('Enviornment Variable set.')
class readfromdatabase(beam.DoFn):
def process(self, context):
logging.info('inside readfromdatabase')
database_user='root'
database_password='<DB_PASSWORD>'
database_host='<DB_HOST>'
database_port='<DB_PORT>'
database_db='<DB_NAME>'
logging.info('reached readfromdatabase 1')
jclassname = "com.mysql.jdbc.Driver"
url = ("jdbc:mysql://{0}:{1}/{2}?user={3}&password={4}".format(database_host,database_port, database_db,database_user,database_password))
jars = ["/tmp/mysql-connector-java-5.1.45.jar"]
libs = None
cnx = jaydebeapi.connect(jclassname, url, jars=jars,
libs=libs)
logging.info('Connection Successful..')
cursor = cnx.cursor()
logging.info('Reading Sql Query from the file...')
query = 'select * from employees.employee_details'
logging.info('Query is %s',query)
logging.info('Query submitted to Database..')
for chunk in pd.read_sql(query, cnx, coerce_float=True, params=None, parse_dates=None, columns=None,chunksize=500000):
chunk.apply(lambda x: x.replace(u'\r', u' ').replace(u'\n', u' ') if isinstance(x, str) or isinstance(x, unicode) else x).WriteToText('gs://first-bucket-arpan/output2/')
logging.info("Load completed...")
return list("1")
def run():
try:
pcoll = beam.Pipeline()
dummy= pcoll | 'Initializing..' >> beam.Create(['1'])
logging.info('inside run 1')
dummy_env = dummy | 'Setting up Instance..' >> beam.ParDo(setenv())
logging.info('inside run 2')
readrecords=(dummy_env | 'Processing' >> beam.ParDo(readfromdatabase()))
logging.info('inside run 3')
p=pcoll.run()
logging.info('inside run 4')
p.wait_until_finish()
except:
logging.exception('Failed to launch datapipeline')
raise
def main():
logging.getLogger().setLevel(logging.INFO)
GOOGLE_APPLICATION_CREDENTIALS="gs://<MY_BUCKET>/My First Project-800a97e1fe65.json"
run()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
main()
我正在使用以下命令运行: python /home/aarpan_roy/readfromdatabase.py --region $REGION --runner DataflowRunner --project $PROJECT --temp_location gs://$BUCKET/tmp
运行时,下注 output 并且没有创建数据流作业:
INFO:root:inside run 2
INFO:root:inside run 3
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function annotate_downstream_side_inputs at 0x7fc1ea8a27d0> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function fix_side_input_pcoll_coders at 0x7fc1ea8a28c0> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function lift_combiners at 0x7fc1ea8a2938> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function expand_sdf at 0x7fc1ea8a29b0> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function expand_gbk at 0x7fc1ea8a2a28> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function sink_flattens at 0x7fc1ea8a2b18> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function greedily_fuse at 0x7fc1ea8a2b90> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function read_to_impulse at 0x7fc1ea8a2c08> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function impulse_to_input at 0x7fc1ea8a2c80> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function sort_stages at 0x7fc1ea8a2e60> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function setup_timer_mapping at 0x7fc1ea8a2de8> ====================
INFO:apache_beam.runners.portability.fn_api_runner.translations:==================== <function populate_data_channel_coders at 0x7fc1ea8a2ed8> ====================
INFO:apache_beam.runners.worker.statecache:Creating state cache with size 100
INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:Created Worker handler <apache_beam.runners.portability.fn_api_runner.worker_handlers.EmbeddedWorkerHandler object at 0x7fc1ea356ad0> for environment ref_Environment_default_environment_1 (beam:env:embedded_python:v1, '')
INFO:apache_beam.runners.portability.fn_api_runner.fn_runner:Running (ref_AppliedPTransform_Initializing../Impulse_3)+((ref_AppliedPTransform_Initializing../FlatMap(<lambda at core.py:2632>)_4)+((ref_AppliedPTransform_Initializing../Map(decode)_6)+((ref_AppliedPTransform_Setting up Instance.._7)+(ref_AppliedPTransform_Processing_8))))
Copying gs://<MY_BUCKET>/mysql-connector-java-5.1.45.jar...
- [1 files][976.4 KiB/976.4 KiB]
Operation completed over 1 objects/976.4 KiB.
INFO:root:Enviornment Variable set.
INFO:root:inside run 4
请帮我解决这个问题,并指导如何使用 Apache Beam Python4814ACED173Z4814A04A02A810D6F33DD956F42794Z 将数据从 MySql 提取到 GCP Bucket
提前致谢。
==================================================== ============================== 大家好,我稍微更改了代码并在导出 GOOGLE_APPLICATION_CREDENTIALS 后从 shell 脚本运行它. 但是,当它执行时,出现以下错误:
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: File /home/aarpan_roy/script/dataflowerviceaccount.json (pointed by GOOGLE_APPLICATION_CREDENTIALS environment variable) does not exist!
Connecting anonymously.
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
ERROR:root:Failed to launch datapipeline"
以下是总日志文件:
aarpan_roy@my-dataproc-cluster-m:~/script/util$ sh -x dataflow_runner.sh
+ export GOOGLE_APPLICATION_CREDENTIALS=gs://<MY_BUCKET>/My First Project-800a97e1fe65.json
+ python /home/aarpan_roy/script/util/loadfromdatabase.py --config config.properties --productconfig cts.properties --env dev --sourcetable employee_details --sqlquery /home/aarpan_roy/script/sql/employee_details.sql --connectionprefix d3 --incrementaldate 1900-01-01
/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/__init__.py:82: UserWarning: You are using Apache Beam with Python 2. New releases of Apache Beam will soon support Python 3 only.
'You are using Apache Beam with Python 2. '
/home/aarpan_roy/script/util/
INFO:root:Job Run Id is 22152
INFO:root:Job Name is load-employee-details-20200820
SELECT * FROM EMPLOYEE_DETAILS
WARNING:apache_beam.options.pipeline_options_validator:Option --zone is deprecated. Please use --worker_zone instead.
INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpEarTQ0', 'apache-beam==2.23.0', '--no-deps', '--no-binary', ':all:']
INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI: dataflow_python_sdk.tar
INFO:apache_beam.runners.portability.stager:Downloading binary distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpEarTQ0', 'apache-beam==2.23.0', '--no-deps', '--only-binary', ':all:', '--python-version', '27', '--implementation', 'cp', '--abi', 'cp27mu', '--platform', 'manylinux1_x86_64']
INFO:apache_beam.runners.portability.stager:Staging binary distribution of the SDK from PyPI: apache_beam-2.23.0-cp27-cp27mu-manylinux1_x86_64.whl
WARNING:root:Make sure that locally built Python SDK docker image has Python 2.7 interpreter.
INFO:root:Using Python SDK docker image: apache/beam_python2.7_sdk:2.23.0. If the image is not available at local, we will try to pull from hub.docker.com
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: File /home/aarpan_roy/script/dataflowerviceaccount.json (pointed by GOOGLE_APPLICATION_CREDENTIALS environment variable) does not exist!
Connecting anonymously.
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pipeline.pb...
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pipeline.pb in 0 seconds.
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pickled_main_session...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/pickled_main_session in 0 seconds.
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/dataflow_python_sdk.tar...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/dataflow_python_sdk.tar in 0 seconds.
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/apache_beam-2.23.0-cp27-cp27mu-manylinux1_x86_64.whl...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://<MY_BUCKET>/load-employee-details-20200820.1597916566.568419/apache_beam-2.23.0-cp27-cp27mu-manylinux1_x86_64.whl in 0 seconds.
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['True', '--service_account_name', 'dataflowserviceaccount', '--service_account_key_file', '/home/aarpan_roy/script/dataflowserviceaccount.json']
ERROR:root:Failed to launch datapipeline
Traceback (most recent call last)
File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 105, in run
p=pcoll.run()
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 521, in run
allow_proto_holders=True).run(False)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 534, in run
return self.runner.run_pipeline(self, self._options)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 586, in run_pipeline
self.dataflow_client.create_job(self.job), self)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
return fun(*args, **kwargs)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 681, in create_job
return self.submit_job_description(job)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
return fun(*args, **kwargs)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 748, in submit_job_description
response = self._client.projects_locations_jobs.Create(request)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 667, in Create
config, request, global_params=global_params)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
HttpForbiddenError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/turing-thought-277215/locations/asia-southeast1/jobs?alt=json>: response: <{'status': '403', 'content-length': '138', 'x-xss-protection': '0', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Thu, 20 Aug 2020 09:42:47 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8', 'www-authenticate': 'Bearer realm="https://accounts.google.com/", error="insufficient_scope", scope="https://www.googleapis.com/auth/compute.readonly https://www.googleapis.com/auth/compute https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.email email https://www.googleapis.com/auth/userinfo#email"'}>, content <{
"error": {
"code": 403,
"message": "Request had insufficient authentication scopes.",
"status": "PERMISSION_DENIED"
}
}
>
Traceback (most recent call last):
File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 194, in <module>
args.sourcetable)
File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 142, in main
run()
File "/home/aarpan_roy/script/util/loadfromdatabase.py", line 105, in run
p=pcoll.run()
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 521, in run
allow_proto_holders=True).run(False)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 534, in run
return self.runner.run_pipeline(self, self._options)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 586, in run_pipeline
self.dataflow_client.create_job(self.job), self)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
return fun(*args, **kwargs)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 681, in create_job
return self.submit_job_description(job)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/utils/retry.py", line 236, in wrapper
return fun(*args, **kwargs)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/apiclient.py", line 748, in submit_job_description
response = self._client.projects_locations_jobs.Create(request)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apache_beam/runners/dataflow/internal/clients/dataflow/dataflow_v1b3_client.py", line 667, in Create
config, request, global_params=global_params)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 731, in _RunMethod
return self.ProcessHttpResponse(method_config, http_response, request)
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 737, in ProcessHttpResponse
self.__ProcessHttpResponse(method_config, http_response, request))
File "/home/aarpan_roy/.local/lib/python2.7/site-packages/apitools/base/py/base_api.py", line 604, in __ProcessHttpResponse
http_response, method_config=method_config, request=request)
apitools.base.py.exceptions.HttpForbiddenError: HttpError accessing <https://dataflow.googleapis.com/v1b3/projects/turing-thought-277215/locations/asia-southeast1/jobs?alt=json>: response: <{'status': '403', 'content-length': '138', 'x-xss-protection': '0', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'vary': 'Origin, X-Origin, Referer', 'server': 'ESF', '-content-encoding': 'gzip', 'cache-control': 'private', 'date': 'Thu, 20 Aug 2020 09:42:47 GMT', 'x-frame-options': 'SAMEORIGIN', 'content-type': 'application/json; charset=UTF-8', 'www-authenticate': 'Bearer realm="https://accounts.google.com/", error="insufficient_scope", scope="https://www.googleapis.com/auth/compute.readonly https://www.googleapis.com/auth/compute https://www.googleapis.com/auth/cloud-platform https://www.googleapis.com/auth/userinfo.email email https://www.googleapis.com/auth/userinfo#email"'}>, content <{
"error": {
"code": 403,
"message": "Request had insufficient authentication scopes.",
"status": "PERMISSION_DENIED"
}
}
>
+ echo 1
1
我无法理解我在哪里犯了错误。 请帮助我解决问题。 提前致谢。
我认为您的帐户凭据中可能存在拼写错误
/home/aarpan_roy/script/dataflowerviceaccount.json
应该是/home/aarpan_roy/script/dataflowserviceaccount.json
。 你能检查一下吗?
这里似乎有两个问题:
一方面是身份验证问题,请执行@Jayadeep 提供的测试,以验证是否是凭据名称的问题。
同样,凭据处理可能存在问题,因为您在 Dataproc 实例中运行代码,因此我建议您在Cloud Shell中测试您的代码
另一方面,我发现另一篇文章提到 python 中与云 SQL 的连接似乎不如 Java 透明(使用jdbIO )。
同时,我发现另一个帖子提到了连接到云 SQL 但使用psycopg2
而不是jaydebeapi
的解决方法。 我建议你试试。
import psycopg2
connection = psycopg2.connect(
host = host,
hostaddr = hostaddr,
dbname = dbname,
user = user,
password = password,
sslmode=sslmode,
sslrootcert = sslrootcert,
sslcert = sslcert,
sslkey = sslkey
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.