简体   繁体   English

从 50,000 多个文件 NFS 挂载目录中检索目录树的更好方法

[英]A better way to retrieve a directory tree from a 50,000+ files NFS mounted directory

I'm been brought in to work on an existing CMS and File Management web application that provides a merchant with a management interface for their online webshops.我被带去开发现有的 CMS 和文件管理 Web 应用程序,该应用程序为商家提供了用于其在线网上商店的管理界面。 The management application is developed in PHP.管理应用程序是用 PHP 开发的。

When the website users are viewing the webshops, the page assets (mainly images in nested folder paths) are referenced directly from the HTML of the webshops and are served directly from a web server which is separate to the CMS system.当网站用户查看网店时,页面资产(主要是嵌套文件夹路径中的图像)直接从网店的 HTML 中引用,并直接从独立于 CMS 系统的 Web 服务器提供服务。

But in order to list / search / allow navigation of the files (ie the File Management part) the CMS application needs to be able to access the files/folders directory structure.但是为了列出/搜索/允许导航文件(即文件管理部分),CMS 应用程序需要能够访问文件/文件夹目录结构。

So we are using Linux NFS mounts to the document file server from the CMS server.所以我们使用 Linux NFS 从 CMS 服务器挂载到文档文件服务器。 This works fairly well if the number of files in any specific merchant's directory tree is not too large (<10000).如果任何特定商家的目录树中的文件数量不是太大(<10000),这将非常有效。 However, some merchant's have more than 100000 files in a nested directory tree.然而,一些商家的嵌套目录树中有超过 100000 个文件。 Walking this size of tree to get just the directory structure can take more than 120 seconds.遍历这种大小的树以获取目录结构可能需要 120 多秒。

Retrieving just the list of files in any one directory is quite fast, but the problem comes when we try to identify which of these "files" are actually directory entries, so we can recurse down the tree.只检索任何一个目录中的文件列表是相当快的,但是当我们试图识别这些“文件”中的哪些实际上是目录条目时,问题就出现了,因此我们可以向下递归树。

It seems that the PHP functions to check the file type (either calling "is_dir" on each filepath retrieved with "readdir" or "scandir", or using "glob" with flag GLOB_ONLYDIR) work on each file individually, not in bulk.似乎用于检查文件类型的 PHP 函数(在使用“readdir”或“scandir”检索的每个文件路径上调用“is_dir”,或使用带有标志 GLOB_ONLYDIR 的“glob”)单独对每个文件起作用,而不是批量处理。 So there are now 1000s and 1000s of NFS commands being sent.所以现在有成千上万的 NFS 命令被发送。 From my research so far, it seems that this is a limitation of NFS, not of PHP.从我目前的研究来看,这似乎是 NFS 的限制,而不是 PHP 的限制。

A stripped down class showing just the function in question:一个精简的类只显示有问题的函数:

class clImagesDocuments {

    public $dirArr;

    function getDirsRecursive( $dir ) {

        if ( !is_dir( $dir )) {
            return false;
        }

        if ( !isset( $this->dirArr )) {
            $this->dirArr = glob( $dir . "/*", GLOB_ONLYDIR );
        } else {
            $this->dirArr = array_merge( $this->dirArr, glob( $dir . "/*", GLOB_ONLYDIR ) );
            return false;
        }

        for( $i = 0; $i < sizeof( $this->dirArr ); $i ++) {
            $this->getDirsRecursive( $this->dirArr [$i] );
        }

        for( $i = 0; $i < sizeof( $this->dirArr ); $i ++) {
            $indexArr = explode( $dir, $this->dirArr [$i] );
            $tempDir[$indexArr[1]] = $this->dirArr [$i];
        }

        $this->dirArr = $tempDir;
    }
}

Executing the same PHP code to retrieve the directory tree etc locally on the file document server is much, much faster (2 or 3 orders of magnitude), presumably because the local filesystem is caching the directory structure.执行相同的 PHP 代码以在文件文档服务器本地检索目录树等要快得多(2 或 3 个数量级),大概是因为本地文件系统正在缓存目录结构。 I am forced to think that my problem is due to NFS.我不得不认为我的问题是由于 NFS。

I'm considering writing a simple webapp which will run on the file document webserver and provide realtime lookups of the directory structure via an API.我正在考虑编写一个简单的 webapp,它将在文件文档 web 服务器上运行,并通过 API 提供目录结构的实时查找。

I'd appreciate any thoughts or suggestions.我很感激任何想法或建议。

An alternative solution - you can prefix all directories with some string and when you get the list with files you can check which ones are actually directories by checking if they contain the string.另一种解决方案 - 您可以用一些字符串作为所有目录的前缀,当您获得包含文件的列表时,您可以通过检查它们是否包含字符串来检查哪些实际上是目录。 You can completely avoid the is_dir() that way.您可以完全避免使用is_dir()这样。

Old question, but current problem for me.老问题,但对我来说是当前的问题。

One solution:一种解决方案:

On your server of better on a storage server (much much much faster) run tree https://linux.die.net/man/1/tree with -X (XML output) on every directory or once on top directory and send output to „.dirStructure.xml” file (with . at the start so you can ignore it from listing)在存储服务器上更好的服务器上(快得多)在每个目录上或在顶级目录上运行带有 -X(XML 输出)的 tree https://linux.die.net/man/1/tree并发送输出到“.dirStructure.xml”文件(以.开头,这样你就可以从列表中忽略它)

eg.例如。 tree -x -f -q -s -D —dirfirst -X树 -x -f -q -s -D —dirfirst -X

Then make your script load this structure and use it to display tree structure.然后让你的脚本加载这个结构并使用它来显示树结构。 You can make this file for every merchant or one global one and just traverse it to find merchant.您可以为每个商家或一个全球商家制作此文件,然后遍历它以查找商家。

You can run it via cron every minute or create and API to invoke running it on storage machine.你可以每分钟通过 cron 运行它,或者创建和 API 来调用在存储机器上运行它。

You can update this xml when changing files.您可以在更改文件时更新此 xml。

No need for a database.不需要数据库。

You can also monitor changes to directory on storage side and recreate xml everytime something changes.您还可以监视存储端目录的更改,并在每次更改时重新创建 xml。 https://superuser.com/questions/181517 https://superuser.com/questions/181517

EDIT: How to monitor a complete directory tree for changes in Linux?编辑: 如何在 Linux 中监视完整目录树的变化?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM