简体   繁体   English

从 Azure 数据工厂将数据摄取到雪花

[英]Data ingestion to snowflake from Azure data factory

Question: Can anyone help me to find a solution for ingesting data from Azure Data factory to snowflake table without using azure blob storage.问题:谁能帮我找到一个解决方案,在不使用 azure blob 存储的情况下将数据从 Azure 数据工厂摄取到雪花表。

Requirements: We got a set of customer IDs stored in snowflake table right now.We want to iterate through each of the customer id and fetch all customer details from Amazon S3 using WebAPI and write it back to snowflake table.要求:我们现在有一组存储在雪花表中的客户 ID。我们想要遍历每个客户 ID,并使用 WebAPI 从 Amazon S3 获取所有客户详细信息并将其写回雪花表。 The current system uses Azure Databricks(PySpark) to POST customer id and GET related json data from S3 using WebAPI,parse json to extract our required info and write it back to snowflake.当前系统使用 Azure Databricks(PySpark) POST 客户 ID 并使用 WebAPI 从 S3 获取相关的 json 数据,解析 json 并将其写回雪花。 But this process takes at least 3 seconds for a single record and we cannot afford to spend that much time for data ingestion as we have large data volume to process and running ADB cluster for long time cost more.但是对于一条记录,这个过程至少需要 3 秒,我们不能花那么多时间进行数据摄取,因为我们需要处理大量数据,并且长时间运行 ADB 集群成本更高。 The solution we think is like instead of using python Web API,we can use azure data factory to get data from s3 bucket and ingest it to snowflake table. The solution we think is like instead of using python Web API,we can use azure data factory to get data from s3 bucket and ingest it to snowflake table. Since the data is customer data,we are not suppose to store that in azure blob storage before writing it to snowflake due to privacy rules.Do we have any other method that can be used to write it to snowflake table directly from s3 or through ADF without using blob storage.由于数据是客户数据,由于隐私规则,我们不应该在将其写入雪花之前将其存储在 azure blob 存储中。我们是否有任何其他方法可以直接从 s3 或通过 ADF 将其写入雪花表不使用 blob 存储。

You can create a databricks notebook and read all data from s3 and for temp purpose store the data on dbfs which will be destroyed as soon as the cluster terminates.您可以创建一个 databricks 笔记本并从 s3 读取所有数据,并出于临时目的将数据存储在 dbfs 上,一旦集群终止,这些数据就会被销毁。

ADF -> Databricks Notebook

Databricks
Read from s3 -> create a pyspark dataframe -> filter the data based on your condition -> write to snowflake

Well, if your data is already on S3 you can just use the COPY INTO command.好吧,如果您的数据已经在 S3 上,您可以使用COPY INTO命令。 https://docs.snowflake.com/en/user-guide/data-load-s3.html https://docs.snowflake.com/en/user-guide/data-load-s3.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Azure从Rest API提取数据 - Data ingestion from Rest API using Azure 使用 Azure 数据工厂将数据从 Azure Data Lake 复制到 SnowFlake,无需阶段 - Copy Data from Azure Data Lake to SnowFlake without stage using Azure Data Factory 无法使用 Azure 数据工厂中的 ODBC 连接器将数据写入雪花数据库 - Not able to write data to Snowflake database using ODBC connector from Azure Data Factory Azure 摄取数据浏览器逻辑 - Azure Data Explorer Logic on ingestion 如何在 Azure 数据工厂中为雪花连接执行下推优化 - How to perform push down optimization in Azure Data Factory for snowflake connection 从 Azure 函数应用程序摄取 Kusto 数据以 403 结束 - Kusto data ingestion from an Azure Function App ends with a 403 如何使用 Azure Data Lake Storage Gen2 和 Azure Data factory V2 执行基于事件的数据摄取? - How to perform Event based data ingestion using Azure Data Lake Storage Gen2 and Azure Data factory V2? Azure Data Explorer 高摄取延迟和流式传输 - Azure Data Explorer High Ingestion Latency with Streaming 使用 REST API 的数据工厂中的数据摄取模式 - Data Ingestion Patterns in Data Factory using REST API 在哪里托管数据摄取 ETL? 输入数据(csv 文件)自动从 Azure blob 存储到 Azure Posgresql - Where to host a data ingestion ETL ? input data (csv file) automatically from Azure blob storage to Azure Posgresql
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM