简体   繁体   English

每当在谷歌云存储上创建指定文件时,我想使用云 function 触发 python 脚本

[英]I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late.一个 csv 文件每天在 0200 小时左右上传到云存储,但有时由于作业失败或系统崩溃文件上传很晚。 So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.所以我想创建一个云 function,只要文件上传到存储,它就可以触发我的 python bq 加载脚本。

file_name : seller_data_{date}
bucket name : sale_bucket/

The question lacks enough description of the desired usecase and any issues the OP has faced.该问题缺乏对所需用例以及 OP 面临的任何问题的足够描述。 However, here are a few possible approaches that you might chose from depending on the usecase.但是,您可以根据用例选择以下几种可能的方法。

  1. The simple way: Cloud Functions with Storage trigger.最简单的方法:带有存储触发器的 Cloud Functions。

This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.每当文件上传到您的存储桶时,这可能是运行 Python function 的最简单和最有效的方法。 The most basic tutorial is this .最基本的教程就是这个

  1. The hard way: App Engine with a few tricks.困难的方法:App Engine 有一些技巧。

Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.有一个基本的 Flask 应用程序托管在 GAE(标准或 Flex)上,有一个端点专门用于处理现有文件的检查,下载 object,对其进行操作,然后做一些事情。

This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).这条路由可以充当自定义的 HTTP 触发的 function,一旦它收到请求(可能来自简单的 curl 请求,来自浏览器的访问,甚至是另一个 Pub 函数)。

Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.一旦收到 GET(或 POST)请求,它就会将 object 下载到/tmp目录中,对其进行处理然后执行某些操作。

The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.与 CF 相比,GAE 的一个小好处是您可以设置至少一个实例以始终保持活动状态,这意味着您不会遇到冷启动,或者在工作完成之前冒着请求超时的风险。

  1. The brutal/overkill way: Clour Run.残酷/矫枉过正的方式:Clour Run。

Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.与 App Engine 类似的方法,但使用 Cloud Run,您还需要使用 Dockerfile,请记住,当没有使用时,Cloud Run 将缩减为零,以及适用于在 Cloud Run 上构建任何应用程序的其他小事.

######################################## #######################################

For all the above approaches, some additional things you might want to achieve are the same:对于上述所有方法,您可能想要实现的一些其他事情是相同的:

a) Downloading the object and doing some processing on it: a) 下载 object 并对其进行一些处理:

You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files.您必须将其下载到/tmp目录,因为它是 GAE 和 CF 存储临时文件的目录。 Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself. Cloud Run 在这里有点不同,但我们不要深入研究它,因为它本身就是一种矫枉过正。

However, keep in mind that if your file is large you might cause a high memory usage.但是,请记住,如果您的文件很大,您可能会导致 memory 使用率很高。

And ALWAYS clean that directory after you have finished with the file.完成文件后,请始终清理该目录。 Also when opening a file always use with open... as it will also make sure to not keep files open.此外,在打开文件时,请始终使用with open... ,因为它还可以确保不让文件保持打开状态。

b) Downloading the latest object in the bucket: b) 下载bucket中最新的object:

This is a bit tricky and it needs some extra custom code.这有点棘手,它需要一些额外的自定义代码。 There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6 .有很多方法可以实现它,但我使用的一种(始终密切关注 memory 的用法)是在创建 object 时上传到存储桶,我得到当前时间,使用正则表达式将其转换为像results_22_6

What happens now is that once I list the objects from my other script, they are already listed in an accending order.现在发生的情况是,一旦我列出了其他脚本中的对象,它们就已经按升序列出了。 So the last element in the list is the latest object.所以列表中的最后一个元素是最新的 object。

So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket.所以基本上我要做的是检查我在/tmp中的文件名是否与存储桶中object[list.length]的名称相同。 If yes then do nothing, if no then delete the old one and download the latest one in the bucket.如果是,则什么也不做,如果不是,则删除旧的并下载存储桶中最新的。

This might not be optimal, but for me it's kinda preferable.这可能不是最佳的,但对我来说它有点可取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Python 3.7将文件添加到云存储时如何使用云功能触发云作曲家DAG - How to trigger cloud composer DAG using cloud function when a file is added to cloud storage using Python 3.7 使用 Google Cloud 函数触发 Cloud Composer - Trigger Cloud Composer Using Google Cloud Function 使用谷歌云函数生成python脚本 - Using google cloud function to spawn a python script 我无法使用python脚本将文件上传到谷歌云? - I can't upload a file to google cloud using python script? 使用Python的Google云存储 - Google Cloud Storage using Python 将新图像文件上传到谷歌云存储或谷歌云中的文件夹时如何调用 python 脚本 - how to call a python script when a new image file is uploaded to a google cloud storage or folder in google cloud Google Storage // Cloud Function // Python 修改Bucket中的CSV文件 - Google Storage // Cloud Function // Python Modify CSV file in the Bucket 如何在Google Cloud Storage上运行Python脚本? - How to run Python script on Google Cloud storage? 如何从python谷歌云函数访问谷歌云存储中文件的文件元数据 - How to access file metadata, for files in google cloud storage, from a python google cloud function 如何将文件从谷歌云存储加载到谷歌云功能 - How to load a file from google cloud storage to google cloud function
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM