简体   繁体   中英

How to mount file as a file object using PySpark in Azure Synapse

I have an azure storage account (Storage gen2) and need to copy files like config.yaml, text files, gz files to reference them inside my code. I have tried the steps listed in https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api , but what this does is to mount a filesystem. If you reference it using for ex: yaml_file_test = mssparkutils.fs.head("synfs:/79/myMount/Test2/config.yaml",100) , it returns a spark dataframe and not a file.

The yaml file contains lot of local variables defined to be used through out the project.

What I'm trying to achieve is, something like below.

with open('synfs:/80/myMount/Test2/config.yaml') as f:
    data = yaml.load(f, Loader=SafeLoader)
    print(data)

The problem is Pyspark doesn't recoginse the path and gives an error: FileNotFoundError: [Errno 2] No such file or directory: 'synfs:/80/myMount/Test2/config.yaml'

I have to access other files too in similar manner and mount them as file objects to traverse and do some operations. For example, some of the libraries like wordninja expect a "gz" file and not a dataframe. When i try that, I get the above error.

If my approach is not correct, can anyone help on how do we actually create global variables inside Azure Synapse environment and how to actually create file objects from a azure storage.

Just to notify, I have also tried other methods of reading from storage like below, but the problem is that all of them return files in a path to read into a dataframe only.

spark.conf.set("spark.storage.synapse.linkedServiceName", LinkService)
        spark.conf.set("fs.azure.account.oauth.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
        print("Connection Setup Successful!")
        return
    except Exception as e:
        print("Connection Setup Failed!- "+str(e))
        return -1

def spark_init(app_name: str = 'Mytest'):
    spark = SparkSession.builder.appName(app_name).getOrCreate()
    sc = spark.sparkContext
    return (spark, sc)

def getStream(streamsetlocation) :

  try:

    spark, sc = spark_init()
    setupConnection(spark,LinkService)
    print(streamsetlocation)
    dfStandardized = spark.read.format("csv").options(header=True).load(streamsetlocation)

Any help would be deeply appreciated.

I could not get the above mount point to read/write binary files. But used fsspec to write a Python pickle file and read it back from Azure Blob Storage.

filename = 'final_model.sav'
sas_key = TokenLibrary.getConnectionString('')
storage_account_name = ‘’
container = ‘’
fsspec_handle = fsspec.open(f'abfs://{container}/{filename}', account_name = storage_account_name, sas_token=sas_key, mode='wb')
with fsspec_handle.open() as o_file:
pickle.dump(model, o_file)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM