简体   繁体   中英

Read a file as byte string from a Databricks ADLS mount point

The goal is to read a file as a byte string within Databricks from an ADLS mount point.

Confirming the ADLS mount point

Firstly, using dbutils.fs.mounts() it is confirmed to have the following:

... MountInfo(mountPoint='/mnt/ftd', source='abfss://ftd@omitted.dfs.core.windows.net/', encryptionType=''), ...

Confirming the existence of the file

The file under question is titled TruthTable.csv , its whereabouts have been confirmed using the following command:

dbutils.fs.ls('/mnt/ftd/TruthTable.csv')

which returns:

[FileInfo(path='dbfs:/mnt/ftd/TruthTable.csv', name='TruthTable.csv', size=156)]

Confirming the readability of the file

To confirm that the file can be read we can run the following snippet.

filePath = '/mnt/ftd/TruthTable.csv'
spark.read.format('csv').option('header','true').load(filePath)

which successfully returns

DataFrame[p: string, q: string, r: string, s: string]

The problem

As the goal is to be able to read a file as a byte string, the following snippet should be successful, however, it is not.

filePath = '/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
  contents = fin.read()
  print(contents)

Executing the following snippet outputs:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/ftd/TruthTable.csv'

The documentation provided by the Databricks team on the following link [https://docs.databricks.com/data/databricks-file-system.html#local-file-apis][https://docs.databricks.com/data/databricks-file-system.html#local-file-apis] works only for files found in the /tmp/ folder, however, the requirement is the read a file directly from the mount point.

Please add dbfs prefix:

filePath = '/dbfs/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
  contents = fin.read()
  print(contents)

For native databricks function (like dbutils) dbfs is used as default location. When you access file system directly you need to add /dbfs which is default mount directory. Alternatively you can use 'dbfs:/mnt/ftd/TruthTable.csv'. If you use free community edition it will not work at all as there is no access to underlying file system. For Azure, Aws and Google edition it should work.

I was able to read the file by replacing the s3a:// and bucket prefix, to the corresponding /dbfs/mnt/ one.

s3a://s3-bucket/lake/output/dept/2022/09/16/20220916_1643_764250.csv /dbfs/mnt/output/dept/2022/09/16/20220916_1643_764250.csv

I used this:

_path = _path.replace('s3a://s3-bucket/lake', '/dbfs/mnt')

Hope it helps.

-ed

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM