How to use os.walk() in Databricks to calculate directory size in Azure datalake

Question

How to use os.walk in Databricks to calculate directory size in Azure datalake. Python version i'm using is 3.

I was Using Recursive method first to get directory size which failed when the file path is deeper inside directory with OOM error.

Now I was curious if os.Walk() will work or not.

Any snippet would help.

Recursive function code is below: [This fails when its deeper path, So, I need different solution]

from dbutils import FileInfo
from typing import List

root_path = "/mnt/ADLS/...../..../"

def discover_size(path: str, verbose: bool = True):
  def loop_path(paths: List[FileInfo], accum_size: float):
    if not paths:
      return accum_size
    else:
      head, tail = paths[0], paths[1:]
      if head.size > 0:
        if verbose:
            <some code>
      else:
            <some code>
  return loop_path(dbutils.fs.ls(path), 0.0)

discover_size(root_path, verbose=True)

Answer 1

Can you try this and feedback?

# Python
import sys, os
root = "/dbfs/mnt/rawdata/"
path = os.path.join(root, "targetdirectory")
for path, subdirs, files in os.walk(root):
    for name in files:
        print(os.path,joint(path, name))

Or maybe this?

# Python
import sys, os
import pandas as pd

mylist = []
    root = '/dbfs/mnt/rawdata/'
    path = os.path.join(root, 'targetdirectory')
    for path, subdirs, files in os.walk(root):
        for name in files:
            mylist.append(os.path.join(path, name))

len(mylist)
df = pd.DataFrame(mylist)

display(df)

# convert python df to spark df
spark_df = spark.createDataframe(df)
# write df out as table
spark_df.write.csv("/rawdata/final.csv")

How to use os.walk() in Databricks to calculate directory size in Azure datalake

Question

1 answers

solution1
0 ACCPTED 2020-05-24 23:07:53

How to use os.walk() in Databricks to calculate directory size in Azure datalake

Question

1 answers

solution1 0 ACCPTED 2020-05-24 23:07:53

solution1
0 ACCPTED 2020-05-24 23:07:53