從 Databricks ADLS 掛載點讀取文件作為字節字符串

Question

目標是從 ADLS 掛載點讀取文件作為 Databricks 中的字節字符串。

確認 ADLS 掛載點

首先，使用dbutils.fs.mounts()確認具有以下內容：

... MountInfo(mountPoint='/mnt/ftd', source='abfss://ftd@omitted.dfs.core.windows.net/', encryptionType=''), ...

確認文件存在

有問題的文件名為TruthTable.csv ，已使用以下命令確認其下落：

dbutils.fs.ls('/mnt/ftd/TruthTable.csv')

返回：

[FileInfo(path='dbfs:/mnt/ftd/TruthTable.csv', name='TruthTable.csv', size=156)]

確認文件的可讀性

為了確認可以讀取文件，我們可以運行以下代碼段。

filePath = '/mnt/ftd/TruthTable.csv'
spark.read.format('csv').option('header','true').load(filePath)

成功返回

DataFrame[p: string, q: string, r: string, s: string]

問題

由於目標是能夠將文件作為字節字符串讀取，因此以下代碼段應該是成功的，但事實並非如此。

filePath = '/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
  contents = fin.read()
  print(contents)

執行以下代碼段輸出：

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/ftd/TruthTable.csv'

Databricks 團隊在以下鏈接中提供的文檔 [https://docs.databricks.com/data/databricks-file-system.html#local-file-apis][https://docs.databricks.com/data /databricks-file-system.html#local-file-apis] 僅適用於/tmp/文件夾中的文件，但是，要求是直接從掛載點讀取文件。

Answer 1

請添加 dbfs 前綴：

filePath = '/dbfs/mnt/ftd/TruthTable.csv'
with open(filePath, 'rb') as fin:
  contents = fin.read()
  print(contents)

對於本機數據塊 function（如 dbutils）dbfs 用作默認位置。 當您直接訪問文件系統時，您需要添加默認掛載目錄 /dbfs。 或者，您可以使用“dbfs:/mnt/ftd/TruthTable.csv”。 如果您使用免費的社區版，它根本無法工作，因為無法訪問底層文件系統。 對於 Azure、Aws 和 Google 版本，它應該可以工作。

Answer 2

我能夠通過將s3a://和存儲桶前綴替換為相應的/dbfs/mnt/來讀取文件。

s3a://s3-bucket/lake/output/dept/2022/09/16/20220916_1643_764250.csv /dbfs/mnt/output/dept/2022/09/16/20220916_1643_764250.csv

我用這個：

_path = _path.replace('s3a://s3-bucket/lake', '/dbfs/mnt')

希望能幫助到你。

-ed

從 Databricks ADLS 掛載點讀取文件作為字節字符串

問題描述

確認 ADLS 掛載點

確認文件存在

確認文件的可讀性

問題

2 個解決方案

解決方案1
0 2021-11-17 13:20:55

解決方案2
0 2022-09-16 20:54:41

從 Databricks ADLS 掛載點讀取文件作為字節字符串

問題描述

確認 ADLS 掛載點

確認文件存在

確認文件的可讀性

問題

2 個解決方案

解決方案1 0 2021-11-17 13:20:55

解決方案2 0 2022-09-16 20:54:41

解決方案1
0 2021-11-17 13:20:55

解決方案2
0 2022-09-16 20:54:41