[英]Data ingestion to snowflake from Azure data factory
Question: Can anyone help me to find a solution for ingesting data from Azure Data factory to snowflake table without using azure blob storage.问题:谁能帮我找到一个解决方案,在不使用 azure blob 存储的情况下将数据从 Azure 数据工厂摄取到雪花表。
Requirements: We got a set of customer IDs stored in snowflake table right now.We want to iterate through each of the customer id and fetch all customer details from Amazon S3 using WebAPI and write it back to snowflake table.要求:我们现在有一组存储在雪花表中的客户 ID。我们想要遍历每个客户 ID,并使用 WebAPI 从 Amazon S3 获取所有客户详细信息并将其写回雪花表。 The current system uses Azure Databricks(PySpark) to POST customer id and GET related json data from S3 using WebAPI,parse json to extract our required info and write it back to snowflake.
当前系统使用 Azure Databricks(PySpark) POST 客户 ID 并使用 WebAPI 从 S3 获取相关的 json 数据,解析 json 并将其写回雪花。 But this process takes at least 3 seconds for a single record and we cannot afford to spend that much time for data ingestion as we have large data volume to process and running ADB cluster for long time cost more.
但是对于一条记录,这个过程至少需要 3 秒,我们不能花那么多时间进行数据摄取,因为我们需要处理大量数据,并且长时间运行 ADB 集群成本更高。 The solution we think is like instead of using python Web API,we can use azure data factory to get data from s3 bucket and ingest it to snowflake table.
The solution we think is like instead of using python Web API,we can use azure data factory to get data from s3 bucket and ingest it to snowflake table. Since the data is customer data,we are not suppose to store that in azure blob storage before writing it to snowflake due to privacy rules.Do we have any other method that can be used to write it to snowflake table directly from s3 or through ADF without using blob storage.
由于数据是客户数据,由于隐私规则,我们不应该在将其写入雪花之前将其存储在 azure blob 存储中。我们是否有任何其他方法可以直接从 s3 或通过 ADF 将其写入雪花表不使用 blob 存储。
You can create a databricks notebook and read all data from s3 and for temp purpose store the data on dbfs which will be destroyed as soon as the cluster terminates.您可以创建一个 databricks 笔记本并从 s3 读取所有数据,并出于临时目的将数据存储在 dbfs 上,一旦集群终止,这些数据就会被销毁。
ADF -> Databricks Notebook
Databricks
Read from s3 -> create a pyspark dataframe -> filter the data based on your condition -> write to snowflake
Well, if your data is already on S3 you can just use the COPY INTO
command.好吧,如果您的数据已经在 S3 上,您可以使用
COPY INTO
命令。 https://docs.snowflake.com/en/user-guide/data-load-s3.html https://docs.snowflake.com/en/user-guide/data-load-s3.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.