简体   繁体   中英

`os.path.getsize()` slow on Network Drive (Python, Windows)

I have a program that iterates over several thousand PNG files on an SMB shared network drive (a 2TB Samsung 970 Evo+) and adds up their individual file sizes. Unfortunately, it is very slow. After profiling the code, it turns out 90% of the execution time is spent on one function:

filesize += os.path.getsize(png)

where each png variable is the filepath to a single PNG file (of the several thousands) in a for loop that iterates over each one obtained from glob.glob() (which, to compare, is responsible for 7.5% of the execution time).

在此处输入图像描述

The code can be found here: https://pastebin.com/SsDCFHLX

Clearly there is something about obtaining the filesize over the network that is extremely slow, but I'm not sure what. Is there any way I can improve the performance? It takes just as long using filesize += os.stat(png).st_size too.

When the PNG files are stored on the computer locally, the speed is not an issue. It specifically becomes a problem when the files are stored on another machine that I access over the local network with a gigabit ethernet cable. Both are running Windows 10.

[2022-08-21 Update]

I tried it again with a 10 gigabit network connection this time and noticed something interesting. The very first time I run the code on the network share, the profiler looks like this:

在此处输入图像描述

but if I run it again afterward, glob() takes up significantly less time while getsize() is about the same:

在此处输入图像描述

if I instead run this code with the PNG files stored on a local NVMe drive (WD SN750) rather than a newtwork drive, here's what the profiler looks like:

在此处输入图像描述

It seems like once it is run for a second time on the network share, something has gotten cached that allows glob() to run much faster on the network share, at around the same speed it would run at on the local NVMe drive. But getsize() remains extremely slow, about 1/10th of the speed as when local.

Can somebody help me understand these two points:

  • Why is getsize() so much slower on the network share? Is there something that can be done to speed it up?
  • Why is glob() slow the first time on the network share but not when I run it again immediately afterward?

You Can Try Getting Path From pathlib

from pathlib import Path

# Build paths inside the project like this: BASE_DIR / 'subdir'.
BASE_DIR = Path(__file__).resolve().parent.parent

If This Did Not Help, Learn More About This Python Module [PathLib]

I don't know why getsize() is as slow as it is over the network, however to speed it up you could try calling it concurrently:

import os
from multiprocessing.pool import ThreadPool

def get_total_filesize_concurrently(paths):
    total = 0

    with ThreadPool(10) as pool:
        for size in pool.imap_unordered(lambda path: os.path.getsize(path), paths):
            total += size

    return total


print(get_total_filesize_concurrently([
    "E:\Path\To\File.txt",
    "E:\Path\To\File2.txt"
    "E:\Path\To\File3.txt"
    ...
]))

You can also play around with the number of threads defined in ThreadPool(10) to potentially increase performance even further.

If you are looking for performance it is better to try more native API. win32file is from pywin

import os
import win32file
import time

# Your directory
SRC_DIR = r'\\pc\share'

def GetFileSize(fullpath:str)->int:
    fhandle = win32file.CreateFile(
        fullpath
        , win32file.GENERIC_READ
        , 0
        , None
        , win32file.OPEN_EXISTING
        , 0
        , None
    )
    size = win32file.GetFileSize(fhandle)
    fhandle = None
    return size

# You should try in different order too:
# for func in (GetFileSize, os.path.getsize):
for func in (os.path.getsize, GetFileSize):
    filesizes = 0
    tstart = time.process_time()
    for dirpath, dirs, filenames in os.walk(SRC_DIR):
        for fname in filenames:
            filesizes += func(os.path.join(dirpath, fname))
    print(f'{func.__name__}: {filesizes} in {time.process_time() - tstart}')

In my case, I get almost double the time gain with GetFileSize

Using GetFileSizeEx . You cannot have code with less syscalls.

This is a trimmed down code from this gist: https://gist.github.com/Pagliacii/774ed5d3ea78a36cdb0754be6a25408d

from ctypes import windll, wintypes
from ctypes import Structure, c_longlong, byref, POINTER, WinError

class LARGE_INTEGER(Structure):
    _fields_ = [('QuadPart', c_longlong)]

CreateFileW = windll.kernel32.CreateFileW
CreateFileW.argtypes = (
    wintypes.LPCWSTR,
    wintypes.DWORD,
    wintypes.DWORD,
    wintypes.LPVOID,
    wintypes.DWORD,
    wintypes.DWORD,
    wintypes.HANDLE)
CreateFileW.restype = wintypes.HANDLE

CloseHandle = windll.kernel32.CloseHandle
CloseHandle.argtypes = (wintypes.HANDLE,)
CloseHandle.restype = wintypes.BOOL

GetFileSizeEx = windll.kernel32.GetFileSizeEx
GetFileSizeEx.argtypes = (wintypes.HANDLE, POINTER(LARGE_INTEGER))
GetFileSizeEx.restype = wintypes.BOOL

GENERIC_READ = 31
OPEN_EXISTING = 3
FILE_SHARE_DISABLE = 0
FILE_ATTRIBUTE_NORMAL = 0x80


def get_filesize_win32(path):
    size = LARGE_INTEGER()
    handle = CreateFileW(path,
        GENERIC_READ,
        FILE_SHARE_DISABLE,
        None, OPEN_EXISTING,
        FILE_ATTRIBUTE_NORMAL,
        None)

    try:
        result = GetFileSizeEx(handle, byref(size))
        if not result:
            raise WinError()
        return size.QuadPart
    finally:
        CloseHandle(handle)

if __name__ == '__main__':
    print(get_filesize_win32(__file__))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM