简体繁体 English

存储下载文件的最佳方法是什么？

[英]What is the best way to store downloaded files?

原文 2009-08-30 20:01:19 6 2 c#/ caching/ filesystems/ webpage/ flat-file

Sorry for the bad title. 对不起，标题不好。

I'm saving web pages. 我正在保存网页。 I currently use 1 XML file as an index. 我目前使用1个XML文件作为索引。 One element contains file created date (UTC), full URL (w. query string and what not). 一个元素包含文件创建日期（UTC），完整URL（带有查询字符串，而没有）。 And the headers in a separate file with similar name but appended special extension. 并将标头放在一个单独的文件中，该文件具有相似的名称，但附加特殊扩展名。

However, going at 40k (incl. header) files, the XML is now 3.5 MB. 但是，要处理40k（包括头文件）文件，XML现在为3.5 MB。 Recently I was still reading, adding new entry, save this XML file. 最近，我仍在阅读，添加新条目，保存此XML文件。 But now I keep it in memory and save it every once in a while. 但是现在我将其保留在内存中，并偶尔保存一次。

When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned. 当我请求页面时，使用XML文件上的XPath查找URL，如果有条目，则返回文件路径。

The directory structure is .\\www.host.com/ randomFilename.randext 目录结构为。\\ www.host.com/ randomFilename.randext

So I am looking for a better way. 所以我正在寻找更好的方法。

Im thinking: 我在想：

One XML file per. 每个XML文件一个。 domain (incl. subdomains). 域（包括子域）。 But I feel this might be a hassle. 但是我觉得这可能很麻烦。
Using SVN. 使用SVN。 I just tested it, but I have no experience in large repositories. 我只是测试了它，但没有大型存储库的经验。 Executing svn add " path to file " for every download, and commit when I'm done. 执行svn为每次下载添加“ 文件路径 ”，并在完成后提交。
Create a custom file system, where I then can include everything I want, for ex. 创建一个自定义文件系统，然后在其中可以包含我想要的所有内容，例如。 POST-data. 发布数据。
Generating a filename from the URL and somehow flattening the querystring, but large querystrings might be rejected by the OS. 从URL生成文件名并以某种方式展平查询字符串，但是操作系统可能会拒绝较大的查询字符串。 And if I keep it with the headers, I still need to keep track of multiple files mapped to each different query string. 而且，如果我将其保留在标头中，则仍然需要跟踪映射到每个不同查询字符串的多个文件。 Hassle. 麻烦 And I don't want it to execute too slow either. 而且我也不希望它执行得太慢。

Multiple program instances will perform read/write operations, on different computers. 多个程序实例将在不同的计算机上执行读/写操作。

If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. 如果我遵循目录/文件方法，则理论上我可以在两者之间添加一个层，以便它可以动态使用DotNetZip 。 But then again, the query string. 但是再一次，查询字符串。

I'm just looking for direction or experience here. 我只是在这里寻找方向或经验。

What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. 我还想要保留这些文件的历史记录的功能，这样就不会覆盖本地文件，然后我可以选择想要的版本（按日期）。 Thats why I tried SVN. 那就是为什么我尝试SVN。

2 个解决方案

I would recommend either a relational database or a version control system. 我建议使用关系数据库或版本控制系统。

You might want to use SQL Server 2008's new FILESTREAM feature to store the files themselves in the database. 您可能想使用SQL Server 2008的新FILESTREAM功能将文件本身存储在数据库中。

I would use 2 data stores, one for the raw files and another for indexes. 我将使用2个数据存储，一个用于原始文件，另一个用于索引。

To stored the flat file, I think Berkeley DB is a good choice, the key can be generated by md5 or other hash function, and you can also compress the content of the file to save some disk space. 要存储平面文件，我认为Berkeley DB是一个不错的选择，密钥可以通过md5或其他哈希函数生成，也可以压缩文件内容以节省一些磁盘空间。

For indexes, you can use relational database or more sophisticated text search engine like Lucene. 对于索引，您可以使用关系数据库或更复杂的文本搜索引擎（如Lucene）。