简体   繁体   English

apache 光束与 gcp 云 function

[英]apache beam with gcp cloud function

Iam trying to create a GCP dataflow in GCP cloud function.我正在尝试在 GCP 云 function 中创建 GCP 数据流。 I have deployed a simple apache beam function which works fine but I get path error when I try to readavro file.我已经部署了一个简单的 apache 光束 function 工作正常,但是当我尝试读取文件时出现路径错误。 And the same script runs when I run from my local with the parameter --runner as Dataflowrunner Some suggestion says I have to do pip install apache-beam[gcp].当我使用参数 --runner 作为 Dataflowrunner 从本地运行时,相同的脚本会运行一些建议说我必须执行 pip 安装 apache-beam [gcp]。 I have already done it in my local and it is working fine.我已经在本地完成了它,并且工作正常。 If I try to install it in GCP it goes for session time out after some time.如果我尝试在 GCP 中安装它,它会在一段时间后导致 session 超时。 Below is my code.下面是我的代码。


#import print library
# This script will read all avro files on a path and print them 
import logging
import os
#import apache beam library
import apache_beam as beam

#import pipeline options.
from apache_beam.options.pipeline_options import  PipelineOptions

#Set log level to info
root = logging.getLogger()
root.setLevel(logging.INFO)

PATH ='gs://mybucket_34545465/cloud_storage_transfer/'
class ComputeWordLengthFn(beam.DoFn):
  def process(self, element):    
    print(element)   
    return [len(element)]

beam_options = PipelineOptions(
    runner='DataflowRunner',
    project='bigqueryproject-34545465',
    job_name='testgcsaccessfromcloudfunction',
    temp_location='gs://temp_34545465/temp',
    region='us-central1')

def hello_pubsub(data, context):
  p = beam.Pipeline(options=beam_options)
  #create a PCollectionfromAVRO file
  transactions = (p        
            | 'Read all from AVRO' >> beam.io.avroio.ReadFromAvro(PATH + 'avrofile_*'))
  word_lengths = transactions | beam.ParDo(ComputeWordLengthFn())

  print(word_lengths)

  # Run the pipeline
  result = p.run()
  #  wait until pipeline processing is complete
  result.wait_until_finish()

I get the following error我收到以下错误

Traceback (most recent call last): File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 2073, in 
wsgi_app response = self.full_dispatch_request() File 
"/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1518, in 
full_dispatch_request rv = self.handle_user_exception(e) File 
"/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1516, in 
full_dispatch_request rv = self.dispatch_request() File 
"/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1502, in 
dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])
(**req.view_args) File "/layers/google.python.pip/pip/lib/python3.8/site-
packages/functions_framework/__init__.py", line 171, in view_func function(data, context)
 File "/workspace/main.py", line 46, in hello_pubsub | 'Read all from AVRO' >> 
beam.io.avroio.ReadFromAvro(PATH + 'avrofile_*')) File 
"/layers/google.python.pip/pip/lib/python3.8/site-packages/apache_beam/io/avroio.py", 
line 145, in __init__ self._source = _create_avro_source( File 
"/layers/google.python.pip/pip/lib/python3.8/site-packages/apache_beam/io/avroio.py", 
line 285, in _create_avro_source _FastAvroSource( File 
"/layers/google.python.pip/pip/lib/python3.8/site-
packages/apache_beam/io/filebasedsource.py", line 126, in __init__ self._validate() File 
"/layers/google.python.pip/pip/lib/python3.8/site-
packages/apache_beam/options/value_provider.py", line 193, in _f return fnc(self, *args, 
**kwargs) File "/layers/google.python.pip/pip/lib/python3.8/site-
packages/apache_beam/io/filebasedsource.py", line 187, in _validate match_result = 
FileSystems.match([pattern], limits=[1])[0] File 
"/layers/google.python.pip/pip/lib/python3.8/site-
packages/apache_beam/io/filesystems.py", line 203, in match filesystem = 
FileSystems.get_filesystem(patterns[0]) File 
"/layers/google.python.pip/pip/lib/python3.8/site-
packages/apache_beam/io/filesystems.py", line 103, in get_filesystem raise ValueError( 
ValueError: Unable to get filesystem from specified path, please use the correct path or 
ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path 
specified: gs://mybucket_34545465/cloud_storage_transfer/avrofile_*



The approach of creating the script directly in cloud function is not the correct way of creating the dataflow.直接在云端创建脚本 function 的方法不是创建数据流的正确方法。 The solution worked for me is.对我有用的解决方案是。

  1. Create a dataflow script locally.在本地创建数据流脚本。
  2. Deploy it as a template example将其部署为模板示例

python Dataflow_script_V3.py --runner DataflowRunner --project project-XXXX --staging_location gs://mybucket/staging --temp_location gs://mybucket/temp --region us-central1 --template_location gs://mybucket/templates/templatename

  1. Create a cloud function创建云function
    # Args:
    #      event (dict): Event payload.
    #      context (google.cloud.functions.Context): Metadata for the event.
    # """
def ProcessAvroFile(event, context):
    
     from googleapiclient.discovery import build
     import datetime
     service = build('dataflow', 'v1b3')

     # Set the following variables to your values.
     current_date = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
     job = 'processavro' + current_date
     JOBNAME = job
     PROJECT = 'project-XXXX'
     BUCKET = 'mybucket'
     TEMPLATE = 'gs://mybucket/templates/templatename'

     # get the file imported
     file = event
     print(f"Processing file: {file['name']}.")
     filename = "gs://myfilebucket/" +file['name']
     BODY = {
      "jobName": "{jobname}".format(jobname=JOBNAME),
      "parameters": {          
          "inputfile" : filename,
      },
      "environment": {
          "tempLocation": "gs://{bucket}/dataflow/temp".format(bucket=BUCKET),
      }
  }

     request = service.projects().templates().launch(projectId=PROJECT, gcsPath=TEMPLATE, body=BODY)
     response = request.execute()

     return (response)
  1. Set a notification trigger in the cloud storage for mybucket to trigger notification gcloud functions deploy ProcessAvroFile在云存储中为 mybucket 设置通知触发器以触发通知 gcloud 函数部署 ProcessAvroFile
    --trigger-bucket=myfilebucket --trigger-bucket=myfilebucket

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spring Cloud Dataflow 与 Apache Beam/GCP 数据流说明 - Spring Cloud Dataflow vs Apache Beam/GCP Dataflow Clarification DebeziumIO 使用 SQL 读取服务器未在 GCP 中使用 Apache 光束进行流式传输 - DebeziumIO read with SQL Server not streaming with Apache beam in GCP MongoIO Apache 带有 Mongo Upsert 管道示例的光束 GCP 数据流 - MongoIO Apache beam GCP Dataflow with Mongo Upsert Pipeline example Apache Beam With GCP Dataflow 抛出 INVALID_ARGUMENT - Apache Beam With GCP Dataflow throws INVALID_ARGUMENT 在 Apache Beam DoFn(谷歌数据流)中下载和上传文件到 GCP 存储桶 - downloading and uploading file to GCP bucket in Apache Beam DoFn(Google Dataflow) GCP 部署云 function 失败 - GCP deploy cloud function failed 在谷歌云中运行 Apache Beam 作业时找不到错误模块 - Error module not found while running Apache beam job in google cloud 在现有的谷歌云 VM 上运行 Apache-beam 管道作业 - Run Apache-beam pipeline job on existing google cloud VM 在 Apache Beam/Google Cloud Dataflow 上创建文件和数据流 - Creating a file and streaming in data on Apache Beam/Google Cloud Dataflow 访问 PCollectionView 的元素<list<foo> &gt;: 谷歌云数据流/Apache Beam </list<foo> - Access elements of PCollectionView<List<Foo>> : Google Cloud Dataflow/Apache Beam
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM