How to use os.walk
in Databricks to calculate directory size in Azure datalake. Python version i'm using is 3.
I was Using Recursive method first to get directory size which failed when the file path is deeper inside directory with OOM
error.
Now I was curious if os.Walk()
will work or not.
Any snippet would help.
Recursive function code is below: [This fails when its deeper path, So, I need different solution]
from dbutils import FileInfo
from typing import List
root_path = "/mnt/ADLS/...../..../"
def discover_size(path: str, verbose: bool = True):
def loop_path(paths: List[FileInfo], accum_size: float):
if not paths:
return accum_size
else:
head, tail = paths[0], paths[1:]
if head.size > 0:
if verbose:
<some code>
else:
<some code>
return loop_path(dbutils.fs.ls(path), 0.0)
discover_size(root_path, verbose=True)
Can you try this and feedback?
# Python
import sys, os
root = "/dbfs/mnt/rawdata/"
path = os.path.join(root, "targetdirectory")
for path, subdirs, files in os.walk(root):
for name in files:
print(os.path,joint(path, name))
Or maybe this?
# Python
import sys, os
import pandas as pd
mylist = []
root = '/dbfs/mnt/rawdata/'
path = os.path.join(root, 'targetdirectory')
for path, subdirs, files in os.walk(root):
for name in files:
mylist.append(os.path.join(path, name))
len(mylist)
df = pd.DataFrame(mylist)
display(df)
# convert python df to spark df
spark_df = spark.createDataframe(df)
# write df out as table
spark_df.write.csv("/rawdata/final.csv")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.