简体   繁体   English

哈希:内部如何运作?

[英]Hash : How does it work internally?

This might sound as an very vague question upfront but it is not. 这听起来可能是一个非常模糊的问题,但事实并非如此。 I have gone through Hash Function description on wiki but it is not very helpful to understand. 我在维基上经历过Hash函数描述,但理解它并不是很有帮助。

I am looking simple answers for rather complex topics like Hashing. 我正在寻找像Hashing这样相当复杂的主题的简单答案。 Here are my questions: 这是我的问题:

  1. What do we mean by hashing? 哈希是什么意思? How does it work internally? 它在内部如何运作?
  2. What algorithm does it follow ? 它遵循什么算法?
  3. What is the difference between HashMap , HashTable and HashList ? HashMapHashTableHashList什么HashList
  4. What do we mean by 'Constant Time Complexity' and why does different implementation of the hash gives constant time operation ? “恒定时间复杂度”是什么意思?为什么哈希的不同实现会给出恒定的时间操作?
  5. Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge? 最后,为什么在大多数面试问题HashLinkedList被问到,是否有任何特定的逻辑来测试受访者的知识?

I know my question list is big but I would really appreciate if I can get some clear answers to these questions as I really want to understand the topic. 我知道我的问题清单很大但我真的很感激,如果我能够对这些问题得到一些明确的答案,我真的想了解这个主题。

  1. Here is a good explanation about hashing. 是关于散列的一个很好的解释。 For example you want to store the string "Rachel" you apply a hash function to that string to get a memory location. 例如,您希望存储字符串“Rachel”,您将哈希函数应用于该字符串以获取内存位置。 myHashFunction(key: "Rachel" value: "Rachel") --> 10 . myHashFunction(key: "Rachel" value: "Rachel") --> 10 The function may return 10 for the input "Rachel" so assuming you have an array of size 100 you store "Rachel" at index 10. If you want to retrieve that element you just call GetmyHashFunction("Rachel") and it will return 10. Note that for this example the key is "Rachel" and the value is "Rachel" but you could use another value for that key for example birth date or an object. 该函数可能会为输入“Rachel”返回10,因此假设您有一个大小为100的数组,则将“Rachel”存储在索引10处。如果要检索该元素,只需调用GetmyHashFunction("Rachel") ,它将返回10注意,对于此示例,键是“Rachel”,值是“Rachel”,但您可以使用该键的另一个值,例如出生日期或对象。 Your hash function may return the same memory location for two different inputs, in this case you will have a collision you if you are implementing your own hash table you have to take care of this maybe using a linked list or other techniques. 您的哈希函数可能会为两个不同的输入返回相同的内存位置,在这种情况下,如果您要实现自己的哈希表,则可能会发生冲突,您可能需要使用链表或其他技术来处理此问题。

  2. Here are some common hash functions used. 以下是一些常用的哈希函数。 A good hash function satisfies that: each key is equally likely to hash to any of the n memory slots independently of where any other key has hashed to. 良好的散列函数满足:每个键同样可能散列到n个内存插槽中的任何一个,与任何其他键散列到的位置无关。 One of the methods is called the division method. 其中一种方法称为除法。 We map a key k into one of n slots by taking the remainder of k divided by n. 我们通过将k的余数除以n,将密钥k映射到n个时隙之一。 h(k) = k mod n . h(k) = k mod n For example if your array size is n = 100 and your key is an integer k = 15 then h(k) = 10 . 例如,如果您的数组大小为n = 100且您的密钥是整数k = 15那么h(k) = 10

  3. Hashtable is synchronised and Hashmap is not. Hashtable是同步的,Hashmap不是。 Hashmap allows null values as key but Hashtable does not. Hashmap允许将空值作为键,但Hashtable不允许。

  4. The purpose of a hash table is to have O(c) constant time complexity in adding and getting the elements. 哈希表的目的是在添加和获取元素时具有O(c)恒定的时间复杂度。 In a linked list of size N if you want to get the last element you have to traverse all the list until you get it so the complexity is O(N). 在大小为N的链表中,如果要获取最后一个元素,则必须遍历所有列表,直到获得它为止,因此复杂度为O(N)。 With a hash table if you want to retrieve an element you just pass the key and the hash function will return you the desired element. 使用哈希表如果要检索元素,只需传递密钥,哈希函数将返回所需的元素。 If the hash function is well implemented it will be in constant time O(c) This means you dont have to traverse all the elements stored in the hash table. 如果哈希函数很好地实现,它将处于恒定时间O(c)这意味着你不必遍历存储在哈希表中的所有元素。 You will get the element "instantly". 您将立即获得该元素。

  5. Of couse a programer/developer computer scientist needs to know about data structures and complexity =) 程序员/开发人员计算机科学家需要了解数据结构和复杂性=)

  1. Hashing means generating a (hopefully) unique number that represents a value. 散列意味着生成表示值的(希望)唯一数字。
  2. Different types of values ( Integer , String , etc) use different algorithms to compute a hashcode. 不同类型的值( IntegerString等)使用不同的算法来计算哈希码。
  3. HashMap and HashTable are maps ; HashMap和HashTable是地图 ; they are a collection of unqiue keys, each of which is associated with a value. 它们是unqiue键的集合,每个键都与一个值相关联。
    Java doesn't have a HashList class. Java没有HashList类。 A Hash Set is a set of unique values. 哈希是一组唯一值。
  4. Getting an item from a hashtable is constant-time with regard to the size of the table. 从哈希表中获取项目是关于表的大小的恒定时间。
    Computing a hash is not necessarily constant-time with regard to the value being hashed. 对于被散列的值,计算散列不一定是恒定时间。
    For example, computing the hash of a string involves iterating the string, and isn't constant-time with regard to the size of the string. 例如,计算字符串的散列涉及迭代字符串,而不是关于字符串大小的常量时间。
  5. These are things that people ought to know. 这些是人们应该知道的事情。
  1. Hashing is transforming a given entity (in java terms - an object) to some number (or sequence). Hashing正在将给定实体(在java术语中 - 一个对象)转换为某个数字(或序列)。 The hash function is not reversable - ie you can't obtain the original object from the hash. 哈希函数是不可逆的 - 即您无法从哈希中获取原始对象。 Internally it is implemented (for java.lang.Object by getting some memory address by the JVM. 在内部实现它(对于java.lang.Object ,通过JVM获取一些内存地址)。

  2. The JVM address thing is unimportant detail. JVM地址是不重要的细节。 Each class can override the hashCode() method with its own algorithm. 每个类都可以使用自己的算法覆盖hashCode()方法。 Modren Java IDEs allow for generating good hashCode methods. Modren Java IDE允许生成好的hashCode方法。

  3. Hashtable and hashmap are the same thing. Hashtable和hashmap是一回事。 They key-value pairs, where keys are hashed. 它们是键值对,其中键是经过哈希处理的。 Hash lists and hashsets don't store values - only keys. 散列列表和散列集不存储值 - 仅存储键。

  4. Constant-time means that no matter how many entries there are in the hashtable (or any other collection), the number of operations needed to find a given object by its key is constant. 常量时间意味着无论哈希表(或任何其他集合)中有多少条目,通过其键查找给定对象所需的操作数是不变的。 That is - 1, or close to 1 那是-1,或接近1

  5. This is basic computer-science material, and it is supposed that everyone is familiar with it. 这是基本的计算机科学材料,并且假设每个人都熟悉它。 I think google have specified that the hashtable is the most important data-structure in computer science. 我认为谷歌已经指定哈希表是计算机科学中最重要的数据结构。

I'll try to give simple explanations of hashing and of its purpose. 我将尝试简单解释散列及其用途。

First, consider a simple list. 首先,考虑一个简单的清单。 Each operation (insert, find, delete) on such list would have O(n) complexity, meaning that you have to parse the whole list (or half of it, on average) to perform such an operation. 此类列表上的每个操作(插入,查找,删除)都具有O(n)复杂性,这意味着您必须解析整个列表(或平均一半)才能执行此类操作。

Hashing is a very simple and effective way of speeding it up: consider that we split the whole list in a set of small lists. 散列是一种非常简单有效的加速方法:考虑我们将整个列表分成一组小列表。 Items in one such small list would have something in common, and this something can be deduced from the key. 一个这样的小列表中的项目将有一些共同点,这个东西可以从密钥中推断出来。 For example, by having a list of names, we could use first letter as the quality that will choose in which small list to look. 例如,通过列出名称,我们可以使用第一个字母作为质量,选择要在哪个小列表中查找。 In this way, by partitioning the data by the first letter of the key, we obtained a simple hash, that would be able to split the whole list in ~30 smaller lists, so that each operation would take O(n)/30 time. 通过这种方式,通过按键的第一个字母对数据进行分区,我们获得了一个简单的哈希,它可以将整个列表拆分成~30个较小的列表,这样每个操作都需要O(n)/ 30次。

However, we could note that the results are not that perfect. 但是,我们可以注意到结果并不完美。 First, there are only 30 of them, and we can't change it. 首先,它们只有30个,我们无法改变它。 Second, some letters are used more often than others, so that the set with Y or Z will be much smaller that the set with A . 其次,有些字母的使用频率高于其他字母,因此带有YZ的集合将远小于带有A的集合。 For better results, it's better to find a way to partition the items in sets of roughly same size. 为了获得更好的结果,最好找到一种方法来分割大小相同的项目。 How could we solve that? 我们怎么能解决这个问题? This is where you use hash functions. 这是您使用哈希函数的地方。 It's such a function that is able to create an arbitrary number of partitions with roughly the same number of items in each. 这是一个能够创建任意数量的分区的功能,每个分区的项目数大致相同。 In our example with names, we could use something like 在我们的名字示例中,我们可以使用类似的东西

int hash(const char* str){
    int rez = 0;
    for (int i = 0; i < strlen(str); i++)
        rez = rez * 37 + str[i];
    return rez % NUMBER_OF_PARTITIONS;
};

This would assure a quite even distribution and configurable number of sets (also called buckets). 这将确保非常均匀的分布和可配置数量的集合(也称为桶)。

What do we mean by Hashing, how does it work internally ? Hashing是什么意思,它在内部如何运作?

Hashing is the transformation of a string shorter fixed-length value or key that represents the original string. 散列是字符串较短的固定长度值或表示原始字符串的键的转换。 It is not indexing. 它没有索引。 The heart of hashing is the hash table. 哈希的核心是哈希表。 It contains array of items. 它包含一系列项目。 Hash tables contain an index from the data item's key and use this index to place the data into the array. 散列表包含数据项密钥的索引,并使用此索引将数据放入数组中。

What algorithm does it follow ? 它遵循什么算法?

In simple words most of the Hash algorithms work on the logic "index = f(key, arrayLength)" 简单来说,大多数哈希算法都使用逻辑“index = f(key,arrayLength)”

Lastly, why in most interview questions Hash and LinkedList are asked, is there any specific logic for it from testing interviewee's knowledge ? 最后,为什么在大多数面试问题Hash和LinkedList被问到,是否有任何特定的逻辑来测试受访者的知识?

Its about how good you are at logical reasoning. 它是关于你在逻辑推理上有多好。 It is most important data-structure that every programmers know it. 这是每个程序员都知道的最重要的数据结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM