Using tqdm on a for loop inside a function to check progress

Question

I'm iterating over a large group files inside a directory tree using the for loop.

While doing so, I want to monitor the progress through a progress bar in console. So, I decided to use tqdm for this purpose.

Currently, my code looks like this:

for dirPath, subdirList, fileList in tqdm(os.walk(target_dir)):
        sleep(0.01)
        dirName = dirPath.split(os.path.sep)[-1]
        for fname in fileList:
        *****

Output:

Scanning Directory....
43it [00:23, 11.24 it/s]

So, my problem is that it is not showing a progress bar. I want to know how to use it properly and get a better understanding of it working. Also, if there are any other alternatives to tqdm that can be used here.

Answer 1

You can't show a percentage complete unless you know what "complete" means.

While os.walk is running, it doesn't know how many files and folders it's going to end up iterating: the return type of os.walk has no __len__ . It'd have to look all the way down the directory tree, enumerating all the files and folders, in order to count them. In other words, os.walk would have to do all of its work twice in order to tell you how many items it's going to produce, which is inefficient.

If you're dead set on showing a progress bar, you could spool the data into an in-memory list: list(os.walk(target_dir)) . I don't recommend this. If you're traversing a large directory tree this could consume a lot of memory. Worse, if followlinks is True and you have a cyclic directory structure (with children linking to their parents), then it could end up looping forever until you run out of RAM.

Answer 2

As explained in the documentation , this is because you need to provide a progress indicator. Depending on what you do with your files, you can either use the files count or the files sizes.

Other answers suggested to convert the os.walk() generator into a list, so that you get a __len__ property. However, this will cost you a lot of memory depending on the total number of files you have.

Another possibility is to precompute : you first walk once your whole file tree and count the total number of files (but without keeping the list of files, just the count!), then you can walk again and provide tqdm with the files count you precomputed:

def walkdir(folder):
    """Walk through every files in a directory"""
    for dirpath, dirs, files in os.walk(folder):
        for filename in files:
            yield os.path.abspath(os.path.join(dirpath, filename))

# Precomputing files count
filescount = 0
for _ in tqdm(walkdir(target_dir)):
    filescount += 1

# Computing for real
for filepath in tqdm(walkdir(target_dir), total=filescount):
        sleep(0.01)
        # etc...

Notice that I defined a wrapper function over os.walkdir : since you are working on files and not on directories, it's better to define a function that will progress on files rather than on directories.

However, you can get the same result without using the walkdir wrapper, but it will be a bit more complicated as you have to resume the last progress bar state after each subfolder that gets traversed:

# Precomputing
filescount = 0
for dirPath, subdirList, fileList in tqdm(os.walk(target_dir)):
    filescount += len(filesList)

# Computing for real
last_state = 0
for dirPath, subdirList, fileList in os.walk(target_dir):
    sleep(0.01)
    dirName = dirPath.split(os.path.sep)[-1]
    for fname in tqdm(fileList, total=filescount, initial=last_state):
        # do whatever you want here...
    # Update last state to resume the progress bar
    last_state += len(fileList)

Answer 3

It's because tqdm doesn't know how long the result of os.walk will be, because it's a generator so len can't be called on it. You can fix this by converting os.walk(target_dir) to a list first:

for dirPath, subdirList, fileList in tqdm(list(os.walk(target_dir))):

From the documentation of the tdqm module:

len(iterable) is used if possible. As a last resort, only basic progress statistics are displayed (no ETA, no progressbar).

But, len(os.walk(target_dir)) isn't possible, so there is no ETA or progress bar.

As Benjamin pointed out, using list does use some memory, but not too much. A spooled directory of ~190,000 files caused Python to use about 65MB of memory with this code on my Windows 10 machine.

Answer 4

Here is a more succinct way of precomputing the number of files and then providing a status bar on the files:

file_count = sum(len(files) for _, _, files in os.walk(folder))  # Get the number of files
with tqdm(total=file_count) as pbar:  # Do tqdm this way
    for root, dirs, files in os.walk(folder):  # Walk the directory
        for name in files:
            pbar.update(1)  # Increment the progress bar
            # Process the file in the walk

Answer 5

You can have progress on all files inside a directory path with tqdm this way.

from tqdm import tqdm
target_dir = os.path.join(os.getcwd(), "..Your path name")#it has 212 files
for r, d, f in os.walk(target_dir):
    for file in tqdm(f, total=len(f)):
        filepath = os.path.join(r, file)
        #f'Your operation on file..{filepath}'

20%|████████████████████ | 42/212 [05:07<17:58, 6.35s/it]

Like this you will get progress...

Answer 6

Here my solution to similar problem:

    for root, dirs, files in os.walk(local_path):
        path, dirs, files = os.walk(local_path).next()
        count_files = (int(len(files)))
        for i in tqdm.tqdm(range(count_files)):
            time.sleep(0.1)
            for fname in files:
                full_fname = os.path.join(root, fname)

Using tqdm on a for loop inside a function to check progress

Question

6 answers

solution1
7 2016-03-13 11:26:31

solution2
3 2016-05-15 01:34:10

solution3
2 2016-03-13 11:25:07

solution4
1 2020-11-27 16:08:17

solution5
1 2021-05-17 15:32:20

solution6
0 2016-08-01 19:22:13

Using tqdm on a for loop inside a function to check progress

Question

6 answers

solution1 7 2016-03-13 11:26:31

solution2 3 2016-05-15 01:34:10

solution3 2 2016-03-13 11:25:07

solution4 1 2020-11-27 16:08:17

solution5 1 2021-05-17 15:32:20

solution6 0 2016-08-01 19:22:13

solution1
7 2016-03-13 11:26:31

solution2
3 2016-05-15 01:34:10

solution3
2 2016-03-13 11:25:07

solution4
1 2020-11-27 16:08:17

solution5
1 2021-05-17 15:32:20

solution6
0 2016-08-01 19:22:13