简体   繁体   中英

How to use os.walk() in Databricks to calculate directory size in Azure datalake

How to use os.walk in Databricks to calculate directory size in Azure datalake. Python version i'm using is 3.

I was Using Recursive method first to get directory size which failed when the file path is deeper inside directory with OOM error.

Now I was curious if os.Walk() will work or not.

Any snippet would help.

Recursive function code is below: [This fails when its deeper path, So, I need different solution]

from dbutils import FileInfo
from typing import List

root_path = "/mnt/ADLS/...../..../"

def discover_size(path: str, verbose: bool = True):
  def loop_path(paths: List[FileInfo], accum_size: float):
    if not paths:
      return accum_size
    else:
      head, tail = paths[0], paths[1:]
      if head.size > 0:
        if verbose:
            <some code>
      else:
            <some code>
  return loop_path(dbutils.fs.ls(path), 0.0)

discover_size(root_path, verbose=True)

Can you try this and feedback?

# Python
import sys, os
root = "/dbfs/mnt/rawdata/"
path = os.path.join(root, "targetdirectory")
for path, subdirs, files in os.walk(root):
    for name in files:
        print(os.path,joint(path, name))

Or maybe this?

# Python
import sys, os
import pandas as pd

mylist = []
    root = '/dbfs/mnt/rawdata/'
    path = os.path.join(root, 'targetdirectory')
    for path, subdirs, files in os.walk(root):
        for name in files:
            mylist.append(os.path.join(path, name))

len(mylist)
df = pd.DataFrame(mylist)

display(df)

# convert python df to spark df
spark_df = spark.createDataframe(df)
# write df out as table
spark_df.write.csv("/rawdata/final.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM