简体   繁体   English

高效的设计将查找表存储在目录中

[英]Efficient design to store lookup table for files in directories

Let's say I have three directories dir1 , dir2 & dir3 , with thousands of files in each. 假设我有三个目录dir1dir2dir3 ,每个目录中都有成千上万个文件。 Each file has a unique name with no pattern. 每个文件都有一个没有模式的唯一名称。

Now, given a filename, I need to find which of the three directories it's in. My first thought was to create a dictionary with the filename as key and the directory as the value, like this: 现在,给定文件名,我需要找到它所在的三个目录中的哪一个。我的第一个念头是创建一个以文件名为键,目录为值的字典,如下所示:

{'file1':'dir1', 
 'file2':'dir3',
 'file3':'dir1', ... }

But seeing as there are only three unique values, this seems a bit redundant and takes up space. 但是看到只有三个唯一值,这似乎有点多余,并且占用了空间。

Is there a better way to implement this? 有没有更好的方法来实现这一目标? What if I can compromise on space but need faster lookup? 如果我可以在空间上妥协但需要更快的查找怎么办?

A simple way to solve this is to query the file-system directly instead of caching all the filenames in a dict . 解决此问题的一种简单方法是直接查询文件系统,而不是在dict中缓存所有文件名。 This will save a lot of space, and will probably be fast enough if there only a few hundred directories to search. 这将节省大量空间,并且如果仅要搜索数百个目录,则可能足够快。

Here is a simple function that does that: 这是一个执行此操作的简单函数:

def find_directory(filename, directories):
    for directory in directories:
        path = os.path.join(directory, filename)
        if os.path.exists(path):
            return directory

On my Linux system, when searching around 170 directories, it takes about 0.3 seconds to do the first search, and then only about 0.002 seconds thereafter. 在我的Linux系统上,当搜索大约170个目录时,第一次搜索大约需要0.3秒,之后大约只需要0.002秒。 This is because the OS does file-caching to speed up repeated searches. 这是因为操作系统会进行文件缓存以加快重复搜索的速度。 But note that if you used a dict to do this caching in Python, you'd still have to pay a similar initial cost. 但是请注意,如果您使用dict在Python中进行此缓存,则仍然需要支付类似的初始费用。

Of course, the subsequent dict lookups would be faster than querying the file-system directly. 当然,后续的dict查找比直接查询文件系统要快。 But do you really need that extra speed? 但是,您真的需要额外的速度吗? To me, two thousandths of second seems easily "fast enough" for most purposes. 在我看来,十分之一秒对于大多数目的来说似乎“足够快”。 And you get the extra benefit of never needing to refresh the file-cache (because the OS does it for you). 而且,您可以获得无需刷新文件缓存的额外好处(因为操作系统会为您完成此工作)。

PS: PS:

I should probably point out that the above timings are worst-case : that is, I dropped all the system file-caches first, and then searched for a filename that was in the last directory. 我可能应该指出,以上时间是最坏的情况 :也就是说,我先删除了所有系统文件缓存,然后搜索了最后一个目录中的文件名。

You can store the index as a dict of sets. 您可以将索引存储为集合的字典。 It might be more memory-efficient. 它可能会更节省内存。

index = {
    "dir1": {"f1", "f2", "f3", "f4"},
    "dir2": {"f3", "f4"},
    "dir3": {"f5", "f6", "f7"},
}

filename = "f4"
for dir, files in index.iteritems():
    if filename in files:
         print dir

Speaking of thousands of files, you'll barely see any difference between this method and your inverted index. 谈到成千上万的文件,您几乎看不到此方法与反向索引之间的任何区别。

Also, repeatable strings in python can be interned in order to save memory. 另外,可以在python中插入可重复的字符串以节省内存。 Sometimes CPython interns short string itself. 有时,CPython会实习生短字符串本身。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM