简体   繁体   中英

A better way to retrieve a directory tree from a 50,000+ files NFS mounted directory

I'm been brought in to work on an existing CMS and File Management web application that provides a merchant with a management interface for their online webshops. The management application is developed in PHP.

When the website users are viewing the webshops, the page assets (mainly images in nested folder paths) are referenced directly from the HTML of the webshops and are served directly from a web server which is separate to the CMS system.

But in order to list / search / allow navigation of the files (ie the File Management part) the CMS application needs to be able to access the files/folders directory structure.

So we are using Linux NFS mounts to the document file server from the CMS server. This works fairly well if the number of files in any specific merchant's directory tree is not too large (<10000). However, some merchant's have more than 100000 files in a nested directory tree. Walking this size of tree to get just the directory structure can take more than 120 seconds.

Retrieving just the list of files in any one directory is quite fast, but the problem comes when we try to identify which of these "files" are actually directory entries, so we can recurse down the tree.

It seems that the PHP functions to check the file type (either calling "is_dir" on each filepath retrieved with "readdir" or "scandir", or using "glob" with flag GLOB_ONLYDIR) work on each file individually, not in bulk. So there are now 1000s and 1000s of NFS commands being sent. From my research so far, it seems that this is a limitation of NFS, not of PHP.

A stripped down class showing just the function in question:

class clImagesDocuments {

    public $dirArr;

    function getDirsRecursive( $dir ) {

        if ( !is_dir( $dir )) {
            return false;
        }

        if ( !isset( $this->dirArr )) {
            $this->dirArr = glob( $dir . "/*", GLOB_ONLYDIR );
        } else {
            $this->dirArr = array_merge( $this->dirArr, glob( $dir . "/*", GLOB_ONLYDIR ) );
            return false;
        }

        for( $i = 0; $i < sizeof( $this->dirArr ); $i ++) {
            $this->getDirsRecursive( $this->dirArr [$i] );
        }

        for( $i = 0; $i < sizeof( $this->dirArr ); $i ++) {
            $indexArr = explode( $dir, $this->dirArr [$i] );
            $tempDir[$indexArr[1]] = $this->dirArr [$i];
        }

        $this->dirArr = $tempDir;
    }
}

Executing the same PHP code to retrieve the directory tree etc locally on the file document server is much, much faster (2 or 3 orders of magnitude), presumably because the local filesystem is caching the directory structure. I am forced to think that my problem is due to NFS.

I'm considering writing a simple webapp which will run on the file document webserver and provide realtime lookups of the directory structure via an API.

I'd appreciate any thoughts or suggestions.

An alternative solution - you can prefix all directories with some string and when you get the list with files you can check which ones are actually directories by checking if they contain the string. You can completely avoid the is_dir() that way.

Old question, but current problem for me.

One solution:

On your server of better on a storage server (much much much faster) run tree https://linux.die.net/man/1/tree with -X (XML output) on every directory or once on top directory and send output to „.dirStructure.xml” file (with . at the start so you can ignore it from listing)

eg. tree -x -f -q -s -D —dirfirst -X

Then make your script load this structure and use it to display tree structure. You can make this file for every merchant or one global one and just traverse it to find merchant.

You can run it via cron every minute or create and API to invoke running it on storage machine.

You can update this xml when changing files.

No need for a database.

You can also monitor changes to directory on storage side and recreate xml everytime something changes. https://superuser.com/questions/181517

EDIT: How to monitor a complete directory tree for changes in Linux?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM