[英]What is the best way to store downloaded files?
Sorry for the bad title. 对不起,标题不好。
I'm saving web pages. 我正在保存网页。 I currently use 1 XML file as an index.
我目前使用1个XML文件作为索引。 One element contains file created date (UTC), full URL (w. query string and what not).
一个元素包含文件创建日期(UTC),完整URL(带有查询字符串,而没有)。 And the headers in a separate file with similar name but appended special extension.
并将标头放在一个单独的文件中,该文件具有相似的名称,但附加特殊扩展名。
However, going at 40k (incl. header) files, the XML is now 3.5 MB. 但是,要处理40k(包括头文件)文件,XML现在为3.5 MB。 Recently I was still reading, adding new entry, save this XML file.
最近,我仍在阅读,添加新条目,保存此XML文件。 But now I keep it in memory and save it every once in a while.
但是现在我将其保留在内存中,并偶尔保存一次。
When I request a page, the URL is looked up using XPath on the XML file, if there is an entry, the file path is returned. 当我请求页面时,使用XML文件上的XPath查找URL,如果有条目,则返回文件路径。
The directory structure is .\\www.host.com/ randomFilename.randext 目录结构为。\\ www.host.com/ randomFilename.randext
So I am looking for a better way. 所以我正在寻找更好的方法。
Im thinking: 我在想:
Multiple program instances will perform read/write operations, on different computers. 多个程序实例将在不同的计算机上执行读/写操作。
If I follow the directory/file method, I could in theory add a layer between so it uses DotNetZip on the fly. 如果我遵循目录/文件方法,则理论上我可以在两者之间添加一个层,以便它可以动态使用DotNetZip 。 But then again, the query string.
但是再一次,查询字符串。
I'm just looking for direction or experience here. 我只是在这里寻找方向或经验。
What I also want is the ability to keep history of these files, so the local file is not overwritten, and then I can pick which version (by date) I want. 我还想要保留这些文件的历史记录的功能,这样就不会覆盖本地文件,然后我可以选择想要的版本(按日期)。 Thats why I tried SVN.
那就是为什么我尝试SVN。
I would recommend either a relational database or a version control system. 我建议使用关系数据库或版本控制系统。
You might want to use SQL Server 2008's new FILESTREAM feature to store the files themselves in the database. 您可能想使用SQL Server 2008的新FILESTREAM功能将文件本身存储在数据库中。
I would use 2 data stores, one for the raw files and another for indexes. 我将使用2个数据存储,一个用于原始文件,另一个用于索引。
To stored the flat file, I think Berkeley DB is a good choice, the key can be generated by md5 or other hash function, and you can also compress the content of the file to save some disk space. 要存储平面文件,我认为Berkeley DB是一个不错的选择,密钥可以通过md5或其他哈希函数生成,也可以压缩文件内容以节省一些磁盘空间。
For indexes, you can use relational database or more sophisticated text search engine like Lucene. 对于索引,您可以使用关系数据库或更复杂的文本搜索引擎(如Lucene)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.