简体   繁体   中英

Counting files in a directory and its sub-directories created after a specified timestamp in PySpark

In my PySpark code, I need to count all avro files created in any sub-directories of a given directory after a specified timestamp and store that count in a variable.

Any recommendations/examples how to accomplish it in PySpark would be much appreciated!

The following is the demonstration of how you can get the count of avro files created after a specified timestamp.

  • I have the following folder structure in my storage account.

目录

  • The sub-directories sub1 and sub2 have the files as shown below.

子1

子2

  • I have mounted my storage account in Databricks workspace. You can use the following code to get the required solution.

  • Create a string path pointing to your directory. Use os.listdir() to list all the contents of the directory (In this case, sub directories).

directory_name = 'dir/'
path_to_directory = "/dbfs/mnt/data/" + directory_name
list_of_sub_directories = os.listdir(path_to_directory)
  • Concat the path to directory with the names of sub directories.
sub_directory_paths = [path_to_directory+sub_directory for sub_directory in list_of_sub_directories]
print(sub_directory_paths)

['/dbfs/mnt/data/dir/sub1', '/dbfs/mnt/data/dir/sub2']
  • Use os.listdir() to get the contents of subfolders and create the necessary paths for them as well. I got them all inside a list.
file_paths = []
for directory in sub_directory_paths:
    file_paths.extend([directory+'/'+filename for filename in os.listdir(directory)])
  • Now create a timestamp (to get files created after this timestamp). Loop through the list of all files, filter the files ending with .avro , use os.stat() to get the details about creation time and perform comparisons to get the count of avro files created after specified timestamp.
files_created_after_time = datetime(2022, 6, 29, 16, 45, 0)
#print(files_created_after_time)

count = 0
files_required = []

for file in file_paths:
    if(file.endswith('.avro')):
        file_stats = os.stat(file)
        file_created_date = datetime.fromtimestamp(file_stats.st_ctime)
        if(file_created_date > files_created_after_time):
            count+=1
            files_required.append(file)
            
print("Number of avro files created after "+ str(files_created_after_time)+ " are: "+str(count))
print("the files are: ",files_required)
  • Output:

操作

You can follow this example and make necessary changes to achieve the desired output

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM