The goal is to read a file as a byte string within Databricks from an ADLS mount point.
Firstly, using dbutils.fs.mounts()
it is confirmed to have the following:
... MountInfo(mountPoint='/mnt/ftd', source='abfss://ftd@omitted.dfs.core.windows.net/', encryptionType=''), ...
The file under question is titled TruthTable.csv
, its whereabouts have been confirmed using the following command:
dbutils.fs.ls('/mnt/ftd/TruthTable.csv')
which returns:
[FileInfo(path='dbfs:/mnt/ftd/TruthTable.csv', name='TruthTable.csv', size=156)]
To confirm that the file can be read we can run the following snippet.
filePath = '/mnt/ftd/TruthTable.csv'
spark.read.format('csv').option('header','true').load(filePath)
which successfully returns
DataFrame[p: string, q: string, r: string, s: string]
As the goal is to be able to read a file as a byte string, the following snippet should be successful, however, it is not.
filePath = '/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
contents = fin.read()
print(contents)
Executing the following snippet outputs:
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/ftd/TruthTable.csv'
The documentation provided by the Databricks team on the following link [https://docs.databricks.com/data/databricks-file-system.html#local-file-apis][https://docs.databricks.com/data/databricks-file-system.html#local-file-apis] works only for files found in the /tmp/
folder, however, the requirement is the read a file directly from the mount point.
Please add dbfs prefix:
filePath = '/dbfs/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
contents = fin.read()
print(contents)
For native databricks function (like dbutils) dbfs is used as default location. When you access file system directly you need to add /dbfs which is default mount directory. Alternatively you can use 'dbfs:/mnt/ftd/TruthTable.csv'. If you use free community edition it will not work at all as there is no access to underlying file system. For Azure, Aws and Google edition it should work.
I was able to read the file by replacing the s3a:// and bucket prefix, to the corresponding /dbfs/mnt/ one.
s3a://s3-bucket/lake/output/dept/2022/09/16/20220916_1643_764250.csv /dbfs/mnt/output/dept/2022/09/16/20220916_1643_764250.csv
I used this:
_path = _path.replace('s3a://s3-bucket/lake', '/dbfs/mnt')
Hope it helps.
-ed
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.