简体   繁体   中英

Shortening long urls with a hash?

I've got a file cache, the files being downloaded from different urls. I'd like to save each file by the name of their url. These names can be quite long though, and I'm on a device using a FAT32 file system - so the long names are eating up resources well before I run out of actual disk space.

I'm looking for a way to shorten the filenames, have gotten suggestions to hash the strings. But I'm not sure if the hashes are guaranteed to be unique for two different strings. It would be bad if I accidentally fetch the wrong image if two hashed urls come up with the same hash value.

Thanks

You could generate an UUID for each URL and use it as the file name.

UUIDs are unique (or "practically unique") and are 36 characters long, so I guess the file name wouldn't be a problem.

As of version 5, the JDK ships with a class to generate UUIDs (java.util.UUID). You could use randomly generate UUIDs if there's a way to associate them with the URLs, or you could use name based UUIDs. Name based UUIDs are always the same, so the following is always true:

String url = ...
UUID urlUuid = UUID.nameUUIDFromBytes(url.getBytes);
assertTrue(urlUuid.equals(UUID.nameUUIDFromBytes(url.getBytes)));

There's no (shortening) hash which can guarantee different hashes for each input. It's simply not possible.

The way I usually do it is by saving the original name at the beginning (eg, first line) of the cache file. So to find a file in the cache you do it like this:

  • Hash the URL
  • Find the file corresponding to that hash
  • Check the first line. If it's the same as the full URL:
  • The rest of the file is from line two and forward

You can also consider saving the URL->file mapping in a database.

But I'm not sure if the hashes are guaranteed to be unique for two different strings.

They very much aren't (and cannot be, due to the pigeonhole principle ). But if the hash is sufficiently long (at least 64 bit) and well-distributed (ideally a cryptographic hash), then the likelihood of a collision becomes so small that it's not worth worrying about.

As a rough guideline, collisions will become likely once the number of files approaches the square root of the number of possible different hashes ( birthday paradox ). So for a 64 bit hash (10 character filenames), you have about a 50% chance of one single collision if you have 4 billion files.

You'll have to decide whether that is an acceptable risk. You can reduce the chance of collision by making the hash longer, but of course at some point that will mean the opposite of what you want.

Currently, the SHA-1 algorithm is recommended. There are no known ways to intentionally provoke collisions for this algorithm, so you should be safe. Provoking collisions with two pieces of data that have common structure (such as the http:// prefix) is even harder. If you save this stuff after you get a HTTP 200 response, then the URL obviously fetched something, so getting two distinct, valid URLs with the same SHA-1 hash really should not be a concern.

If it's of any re-assurance Git uses it to identify all objects, commits and folders in the source code repository. I've yet to hear of someone with a collision in the object store.

Hashes are not guaranteed to be unique, but the chance of a collision is vanishingly small.

If your hash is, say, 128 bits then the chance of a collision for any pair of entries is 1 in 2^128. By the birthday paradox, if you had 10^18 entries in your table then the chance of a collision is only 1%, so you don't really need to worry about it. If you are extra paranoid then increase the size of the hash by using SHA256 or SHA512.

Obviously you need to make sure that the hashed representation actually takes up less space than the original filename. Base-64 encoded strings represent 6 bits per character so you can do the math to find out if it's even worth doing the hash in the first place.

If your file system barfs because the names are too long then you can create prefix subdirectories for the actual storage. For example, if a file maps the the hash ABCDE then you can store it as /path/to/A/B/CDE , or maybe /path/to/ABC/DE depending on what works best for your file system.

Git is a good example of this technique in practice.

what you can do is save the files by an index and use a index file to find the location of the actual file

in the directory you have:

index.txt
file1
file2
...
etc.

and in index.txt you use some datastructure to find the filenames efficiently (or replace with a DB)

Look at my comment.
One possible solution (there are a lot) is creating a local file (SQLite? XML? TXT?) in which you store a pair (file_id - file_name) so you can save your downloaded files with their unique ID as filename.
Just an idea, not the best...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM