简体   繁体   English

轻量级的ETL,将Google Cloud Storage和Cloud Functions与Python 3.7结合使用

[英]lightweight ETL using Google Cloud Storage and Cloud Functions with Python 3.7

I'm new to GCS and Cloud Functions and would like to understand how I can do an lightweight ETL using these two technologies combined with Python (3.7). 我是GCS和Cloud Functions的新手,想了解如何使用这两种与Python(3.7)结合的技术来进行轻量级ETL。

I have a GCS bucket called 'Test_1233' containing 3 files (all structurally identical). 我有一个名为“ Test_1233”的GCS存储桶,其中包含3个文件(所有文件在结构上均相同)。 When a new file is added to this gcs bucket, I would like the following python code to run and produce an 'output.csv file' and save in the same bucket. 当将新文件添加到此gcs存储桶时,我希望以下python代码运行并生成“ output.csv文件”,然后保存在同一存储桶中。 The code I'm trying to run is below: 我尝试运行的代码如下:

import pandas as pd     
import glob 
import os 
import re
import numpy as np


path  = os.getcwd()  
files = os.listdir(path) ## Originally this was intentended for finding files in the local directlory - I now need this adapted for finding files within gcs(!)

### Loading Files by Variable ###
df   = pd.DataFrame()
data = pd.DataFrame()

for files in glob.glob('gs://test_1233/Test *.xlsx'): ## attempts to find all relevant files within the gcs bucket

    data = pd.read_excel(files,'Sheet1',skiprows=1).fillna(method='ffill') 
    date = re.compile(r'([\.\d]+ - [\.\d]+)').search(files).groups()[0] 
    data['Date'] = date
    data['Start_Date'], data['End_Date'] = data['Date'].str.split(' - ', 1).str
    data['End_Date'] = data['End_Date'].str[:10]
    data['Start_Date'] = data['Start_Date'].str[:10]
    data['Start_Date'] =pd.to_datetime(data['Start_Date'],format ='%d.%m.%Y',errors='coerce') 
    data['End_Date']= pd.to_datetime(data['End_Date'],format ='%d.%m.%Y',errors='coerce')
    df  = df.append(data)
    df

df['Product'] = np.where(df['Product'] =='BR: Tpaste Adv Wht 2x120g','ToothpasteWht2x120g',df['Product']) 

##Stores cleaned data back into same gcs bucket as 'csv' file
df.to_csv('Test_Output.csv')

As I'm totally new to this, I'm not sure how I create the correct path to read all the files within the cloud environment (I used to read files from my local directory!). 由于这是我的新手,所以我不确定如何创建正确的路径来读取云环境中的所有文件(我曾经从本地目录中读取文件!)。

Any help would be most appreciated. 非常感激任何的帮助。

If you want to download files from somehwere and (temporarily) write them to local files in the Cloud Functions runtime, be sure you read the documentation : 如果您要从某个位置下载文件并将其(临时)写入Cloud Functions运行时中的本地文件,请确保您阅读了以下文档

The only writeable part of the filesystem is the /tmp directory, which you can use to store temporary files in a function instance. 文件系统的唯一可写部分是/ tmp目录,您可以使用该目录将临时文件存储在函数实例中。 This is a local disk mount point known as a "tmpfs" volume in which data written to the volume is stored in memory. 这是一个本地磁盘安装点,称为“ tmpfs”卷,写入该卷的数据存储在内存中。 Note that it will consume memory resources provisioned for the function. 请注意,它将消耗为此功能配置的内存资源。

The rest of the file system is read-only and accessible to the function. 文件系统的其余部分是只读的,并且该功能可以访问。

Or, you can just read and work with them directly into memory, as the file contents will consume memory either way. 或者,您可以直接读取它们并直接将它们使用到内存中,因为文件内容将以两种方式消耗内存。

You'll need to download/upload the files from Google Cloud Storage to your Cloud Function environment first, using the google-cloud-storage module. 您首先需要使用google-cloud-storage模块将文件从Google Cloud Storage下载/上传到Cloud Function环境。 See: 看到:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM