简体   繁体   English

检查数据库表中引用的文件的物理存在

[英]Check physical existence of files referenced in DB table

We have one rather large table containing documents info together with filepaths pointing to files on file system. 我们有一个相当大的表,其中包含文档信息以及指向文件系统上文件的文件路径。 After couple of years we noticed that we have files on the disk which are not referenced in DB table and vice-versa. 几年后,我们注意到磁盘上有文件,这些文件在DB表中没有被引用,反之亦然。

Since currently I'm learning Clojure I tought it would be nice to make small utility which can find diff between db and file system. 因为目前我正在学习Clojure,所以我想成为一个小实用工具,可以找到db和文件系统之间的区别。 Naturally, since i'm beginner I got stucked because there's more than 600 000 documents and obviously I need some more performant and less memory consuming solution :) 自然,因为我是新手,所以我陷入了困境,因为有60万多个文档,显然我需要一些性能更高,内存消耗更少的解决方案:)

My first idea was to generate flatten filesystem tree list with all files, and compare it with list from db, if file doesn't exist put in separate list "non-existing" and if some file exists on HDD and not in DB, move it to some dump directory. 我的第一个想法是生成具有所有文件的扁平文件系统树列表,并将其与db中的列表进行比较,如果文件不存在,则将其放在单独的列表“不存在”中,并且如果某些文件存在于HDD而不是DB中,则移动它到一些转储目录。

Any ideas? 有任何想法吗?

As a sketch, here's how you could check the filesystem against the database, in chunks of whatever size you're happy with: 简要介绍一下,这是您可以根据自己喜欢的大小在数据库中检查文件系统的方法:

(->> (file-seq (java.io.File. "/"))
     (remove (memfn isDirectory))
     (partition 20)
     (map (fn [files] (printf "Checking %d files against db...\n" (count files))))
     (take 2))

(Checking 20 files against db...
Checking 20 files against db...
nil nil)

Instead of using printf , do some kind of database checks against the list of files. 代替使用printf ,对文件列表进行某种数据库检查。

I would suggest one of three options depending on your preference for performance vs. memory: 根据您对性能还是内存的偏好,我建议使用以下三个选项之一:

  1. Memory intensive: Use a recursive method calling File.listFiles to put all the files into a list. 占用大量内存:使用调用File.listFiles的递归方法将所有文件放入列表中。 Then compare the list against your DB. 然后将该列表与您的数据库进行比较。

  2. IO intensive solution: Recursively check each file one at a time against the DB. IO密集型解决方案:一次针对数据库递归检查每个文件。

  3. Intermediate solution: read all the files in one dir, compare them against the DB. 中间解决方案:读取一个目录中的所有文件,并将其与数据库进行比较。 Recurse on any sub-dirs and repeat. 递归任何子目录并重复。 Has the same number of IO calls as option 1 but only holds one branch + one dir worth of file paths in memory at any one time. 具有与选项1相同的IO调用次数,但在任何时候仅在内存中保留一个分支+一个目录值的文件路径。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM