In my PySpark code, I need to count all avro files created in any sub-directories of a given directory after a specified timestamp and store that count in a variable.
Any recommendations/examples how to accomplish it in PySpark would be much appreciated!
The following is the demonstration of how you can get the count of avro files created after a specified timestamp.
sub1
and sub2
have the files as shown below. I have mounted my storage account in Databricks workspace. You can use the following code to get the required solution.
Create a string path pointing to your directory. Use os.listdir()
to list all the contents of the directory (In this case, sub directories).
directory_name = 'dir/'
path_to_directory = "/dbfs/mnt/data/" + directory_name
list_of_sub_directories = os.listdir(path_to_directory)
sub_directory_paths = [path_to_directory+sub_directory for sub_directory in list_of_sub_directories]
print(sub_directory_paths)
['/dbfs/mnt/data/dir/sub1', '/dbfs/mnt/data/dir/sub2']
os.listdir()
to get the contents of subfolders and create the necessary paths for them as well. I got them all inside a list.file_paths = []
for directory in sub_directory_paths:
file_paths.extend([directory+'/'+filename for filename in os.listdir(directory)])
.avro
, use os.stat()
to get the details about creation time and perform comparisons to get the count of avro files created after specified timestamp.files_created_after_time = datetime(2022, 6, 29, 16, 45, 0)
#print(files_created_after_time)
count = 0
files_required = []
for file in file_paths:
if(file.endswith('.avro')):
file_stats = os.stat(file)
file_created_date = datetime.fromtimestamp(file_stats.st_ctime)
if(file_created_date > files_created_after_time):
count+=1
files_required.append(file)
print("Number of avro files created after "+ str(files_created_after_time)+ " are: "+str(count))
print("the files are: ",files_required)
You can follow this example and make necessary changes to achieve the desired output
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.