简体   繁体   中英

List files in a blob storage container using spark activity in Azure Data Factory V2

我想知道如何使用活动(最好是Azure Data Factory V2中的pyspark)连接并列出Blob存储容器中可用的文件

There a few ways which could help you:

When you are using HDInsight Hadoop or Spark clusters in Azure, they are automatically pre-configured to access Azure Storage Blobs via the hadoop-azure module that implements the standard Hadoop FilesSystem interface. You can learn more about how HDInsight uses blob storage at https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/

A detailed guide can be found in this Blog post: https://blogs.msdn.microsoft.com/arsen/2016/07/13/accessing-azure-storage-blobs-from-spark-1-6-that-is-running-locally/

Another source which shows the integration of Storage API usage with Spark can be found in this slide: https://www.slideshare.net/BrajaDas/azure-blob-storage-api-for-scala-and-spark

This python script allows access to the blobs via a pyspark script run using Azure Datafactory V2.

https://github.com/Azure-Samples/storage-blobs-python-quickstart/blob/master/example.py

However I had to use

from azure.storage.blob import BlobService

instead of the suggested

from azure.storage.blob import BlockBlobService

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM