文件主机的目录结构

Question

I've got a simple file host going that gives files a unique id and just stores them in a directory.我有一个简单的文件主机，它为文件提供一个唯一的 ID，并将它们存储在一个目录中。 I've been told that this will cause problems in the future, and I'm wondering what things I should look out for to make sure it works smoothly into the future and beyond.有人告诉我，这将在未来引起问题，我想知道我应该注意什么以确保它在未来及以后顺利运行。

Also, is there a performance issue with forcing downloads by sending header information and readfile()?此外，通过发送 header 信息和 readfile() 来强制下载是否存在性能问题？ Would it be better to preserve file names and allow uses to direct download isn't of using a script?保留文件名并允许使用直接下载不是使用脚本会更好吗？

Thanks谢谢

Answer 1

The kind of problems you have been told about very likely have to do with the performance impact of piling thousands and thousands of files in the same directory .您被告知的这类问题很可能与将成千上万个文件堆积在同一目录中对性能的影响有关。

To circumvent this, do not store your files directly under one directory, but try to spread them out under subdirectories ( buckets ).为了避免这种情况，不要将文件直接存储在一个目录下，而是尝试将它们分散到子目录（存储桶）下。

In order to achieve this, look at the ID (let's say 19873) of the file you are about to store, and store it under <uploads>/73/98/19873_<filename.ext> , where 73 is ID % 100 , 98 is (ID / 100) % 100 etc.为了实现这一点，请查看您要存储的文件的 ID（比如说 19873），并将其存储在<uploads>/73/98/19873_<filename.ext>下，其中 73 是ID % 100 ， 98 是(ID / 100) % 100等等。

The above guarantees that you will have at most 100 subdirectories under <uploads> , and at most 100 further subdirectories underneath <uploads>/* .以上保证您在<uploads>下最多有 100 个子目录，在<uploads>/*下最多有 100 个其他子目录。 This will thin out the number of files per directory at the leaves significantly.这将显着减少叶子中每个目录的文件数量。

Two levels of subdirectories are typical enough, and represent a good balance between not wasting too much time resolving directory or file names to inodes both in breadth (what happens when you have too many filenames to look through in the same directory - although modern filesystems such as ext3 will be very efficient here) and depth (what happens when you have to go 20 subdirectories deep looking for your file).两个级别的子目录很典型，并且代表了在不浪费太多时间将目录或文件名解析为 inode 之间的良好平衡（当您在同一目录中查看太多文件名时会发生什么 - 尽管现代文件系统这样因为ext3在这里将非常有效）和深度（当您必须深入 go 20 个子目录来查找您的文件时会发生什么）。 You may also elect to use larger or smaller values (10, 1000) instead of 100. Two levels with modulo 100 would be ideal for between 100k and 5M files您也可以选择使用更大或更小的值（10、1000）而不是 100。模 100 的两个级别对于 100k 和 5M 之间的文件是理想的

Employ the same technique to calculate the full path of a file on the filesystem given the ID of a file that needs to be retrieved.在给定需要检索的文件 ID 的情况下，使用相同的技术来计算文件系统上文件的完整路径。

Answer 2

Your first question really depends on the type of file system you are using.您的第一个问题实际上取决于您使用的文件系统的类型。 I'll assume ext3 without any journaling optimizations when answering.我会假设 ext3 在回答时没有任何日志优化。

First, yes, many files in one place could cause a problem when the number of files exceeds the system ARG_MAX.首先，是的，当文件数量超过系统 ARG_MAX 时，一个地方的许多文件可能会导致问题。 In other words, rm -rf * would quit while complaining about too many arguments.换句话说，rm -rf * 会在抱怨太多 arguments 时退出。 You might consider having diretories AZ / az and parking the files appropriately based on the value of the left most byte in its unique name.您可能会考虑拥有目录 AZ / az 并根据其唯一名称中最左边字节的值适当地停放文件。

Also, try to avoid processes that will open all of those files in a short period of time... crons like 'updatedb' will cause problems once you really start filling up.另外，尽量避免在短时间内打开所有这些文件的进程......一旦你真正开始填满，像'updatedb'这样的crons会导致问题。 Likewise, try to keep those directories out of the scope of commands like 'find'.同样，尝试将这些目录排除在“查找”等命令的 scope 之外。

That leads to the other potential issue, buffers.这导致了另一个潜在的问题，缓冲区。 How frequently are these files accessed?这些文件的访问频率如何？ If there were 300 files in a given directory, would all of them be accessed at least once per 30 minutes?如果给定目录中有 300 个文件，是否每 30 分钟至少访问一次？ If so, you'll likely want to turn up the /proc/sys/vfs_cache_pressure setting so that Linux will reclaim more memory and make it available to PHP/Apache/Etc.如果是这样，您可能希望打开 /proc/sys/vfs_cache_pressure 设置，以便 Linux 将回收更多 memory 并使其可用于 PHP/Apache/Etc。

Finally, regarding readfile... I would suggest just using a direct download link.最后，关于 readfile ......我建议只使用直接下载链接。 This avoids PHP having to stay alive during the course of the download.这避免了 PHP 在下载过程中必须保持活动状态。

Answer 3

Also, is there a performance issue with forcing downloads by sending header information and readfile()?此外，通过发送 header 信息和 readfile() 来强制下载是否存在性能问题？

Yes, if you do it naively.是的，如果你天真地这样做。 A good file download script should:一个好的文件下载脚本应该：

stream long files to avoid filling memory stream 长文件避免填充 memory
support ETags and Last-Modified request/response headers to ensure caches continue to work支持 ETags 和 Last-Modified 请求/响应标头以确保缓存继续工作
come up with reasonable Expires/Cache-Control settings提出合理的 Expires/Cache-Control 设置

It still won't be as fast as the web server (which is typically written in C and heavily optimised for serving files, maybe even using OS kernel features for it), but it'll be much better.它仍然不会像 web 服务器那样快（它通常用 C 编写，并针对服务文件进行了大量优化，甚至可能使用 OS kernel。它的功能会更好）

Would it be better to preserve file names and allow uses to direct download isn't of using a script?保留文件名并允许使用直接下载不是使用脚本会更好吗？

It would perform better, yes, but getting the security right is a challenge.它会表现得更好，是的，但是获得安全权利是一个挑战。 See here for some discussion.请参阅此处进行一些讨论。

A compromise is to use a rewrite, so that the URL looks something like:折衷方案是使用重写，以便 URL 看起来像：

hxxp://www.example.com/files/1234/Lovely_long_filename_that_can_contain_any_Unicode_character.zip

But it gets redirected internally to:但它在内部被重定向到：

hxxp://www.example.com/realfiles/1234.dat

and served (quickly) by the web server.并由 web 服务器（快速）提供服务。

Answer 4

If you're likely to have thousands of files, you should spread them among many subdirectories.如果您可能有数千个文件，则应该将它们分布在许多子目录中。

I suggest keeping the original filename, though you might need to mangle it to guarantee uniqueness.我建议保留原始文件名，尽管您可能需要修改它以保证唯一性。 This helps when you are diagnosing problems.这有助于您诊断问题。

Answer 5

I'm my opinion I suggest using some script to control abuse.我认为我建议使用一些脚本来控制滥用。 Also I suggest to preserve file names unless your script will create an index on a database in relation to it's original state.此外，我建议保留文件名，除非您的脚本将在数据库上创建与其原始 state 相关的索引。 You could also try to do a script with some Rewrite magic on it, this way bringing another layer of security by not exposing the real name behind (your unique id) to the end user.您还可以尝试编写一个带有一些 Rewrite 魔法的脚本，这样可以通过不将背后的真实姓名（您的唯一 ID）暴露给最终用户来带来另一层安全性。

文件主机的目录结构

问题描述

5 个解决方案

解决方案1
6 2009-03-05 02:21:48

解决方案2
3 2009-03-05 02:27:02

解决方案3
3 已采纳 2009-03-05 03:08:28

解决方案4
1 2009-03-05 02:39:28

解决方案5
0 2009-03-05 02:17:30

文件主机的目录结构

问题描述

5 个解决方案

解决方案1 6 2009-03-05 02:21:48

解决方案2 3 2009-03-05 02:27:02

解决方案3 3 已采纳 2009-03-05 03:08:28

解决方案4 1 2009-03-05 02:39:28

解决方案5 0 2009-03-05 02:17:30

解决方案1
6 2009-03-05 02:21:48

解决方案2
3 2009-03-05 02:27:02

解决方案3
3 已采纳 2009-03-05 03:08:28

解决方案4
1 2009-03-05 02:39:28

解决方案5
0 2009-03-05 02:17:30