简体   繁体   English

实现BlackList的最有效方法

[英]Most Efficient way of implementing a BlackList

I developing a Ip filter and was guessing how i could, using any type of esque data structure, develop a VERY efficient and fast BlackList filter. 我正在开发一个Ip过滤器,并猜测我如何使用任何类型的esque数据结构,开发一个非常高效和快速的BlackList过滤器。

What i want to do is simple, every incoming/outcoming connection i have to check in a list of blocked IP´s. 我想要做的是简单,每个传入/传出连接我必须检查被阻止的IP列表。

The IPs are scattered, and the memory use should be linear(not dependent of the number of blocked list, because i want to use on limited systems(homebrew routers)). IP是分散的,内存使用应该是线性的(不依赖于阻塞列表的数量,因为我想在有限的系统(自制路由器)上使用)。

I have time and could create anything from zero. 我有时间,可以从零创造任何东西。 The difficulty is not important to me. 困难对我来说并不重要。 If you can use anything, what you should do ? 如果你可以使用任何东西,你应该怎么做?

Hashtables are the way to go. 哈希表是要走的路。 They have averaged O(1) complexity for lookup, insertion and deletion! 它们在查找,插入和删除方面具有平均O(1)复杂度! They tend to occupy more memory than trees but are much faster. 它们往往比树木占用更多的记忆,但速度要快得多。

Since you are just working with 32 bit integer (you can of course convert an IP to a 32 bit integer) things will be amazingly simple and fast. 由于您只使用32位整数(当然可以将IP转换为32位整数),所以事情会非常简单快速。

You can just use a sorted array. 您只需使用已排序的数组即可。 Insertion and removal cost is O(n) but lookup is O(log n) and especially memory is just 4 byte for each ip. 插入和删除成本为O(n),但查找为O(log n),特别是每个ip的内存只有4个字节。 The implementation is very simple, perhaps too much :D 实施非常简单,也许太多了:D

Binary trees have complexity of O(log n) for lookup, insertion and deletion. 二叉树具有查找,插入和删除的O(log n)的复杂性。 A simple binary tree would not be sufficient however, you need an AVL tree or a Red Black Tree, that can be very annoying and complicated to implement. 但是,一个简单的二叉树是不够的,你需要一个AVL树或一个红黑树,它可能非常烦人且实现起来很复杂。 AVL and RBT trees are able to balance themselves, and we need that because an unbalanced tree will have a worst time complexity of O(n) for lookup, that is the same of a simple linked list! AVL和RBT树能够平衡自己,我们需要这样,因为不平衡的树将具有最差的O(n)查找时间复杂度,这与简单的链表相同!

If instead of single and unique ip u need to ban ip ranges, probably you need a Patricia Trie, also called Radix Tree, they were invented for word dictionaries and for ip dictionaries. 如果不是单一且唯一的ip,你需要禁用ip范围,可能你需要一个Patricia Trie,也称为Radix Tree,它们是为词典和ip词典而发明的。 However these trees can be slower if not well written\\balanced. 然而,如果没有很好的书写\\平衡,这些树可能会更慢。 Hashtable are always better for simple lookups! 对于简单的查找,Hashtable总是更好! They are too fast to be real :) 它们太快而不真实:)

Now about synchronization: 现在关于同步:

If you are filling the black list only once at application startup, you can use a plain read only hashtable (or radix tree) that don't have problems about multithreading and locking. 如果在应用程序启动时只填充黑名单一次,则可以使用没有多线程和锁定问题的普通只读哈希表(或基数树)。

If you need to update it not very often, I would suggest you the use reader-writer locks. 如果你不经常更新它,我会建议你使用读写器锁。

If you need very frequent updates I would suggest you to use a concurrent hashtable. 如果您需要非常频繁的更新,我建议您使用并发哈希表。 Warning: don't write your own, they are very complicated and bug prone, find an implementation on the web! 警告:不要自己编写,它们非常复杂且容易出错,在网上找到实现! They use a lot the (relatively) new atomic CAS operations of new processors (CAS means Compare and Swap). 他们使用了很多新的处理器(相对)新的原子CAS操作(CAS意味着比较和交换)。 These are a special set of instructions or sequence of instructions that allow 32 bit or 64 bit fields on memory to be compared and swapped in a single atomic operation without the need of locking. 这些是一组特殊的指令或指令序列,允许在单个原子操作中比较和交换存储器上的32位或64位字段,而无需锁定。 Using them can be complicated because you have to know very well your processor, your operative system, your compiler and the algorithm itself is counterintuitive. 使用它们可能很复杂,因为您必须非常了解您的处理器,操作系统,编译器和算法本身是违反直觉的。 See http://en.wikipedia.org/wiki/Compare-and-swap for more informations about CAS. 有关CAS的更多信息,请参见http://en.wikipedia.org/wiki/Compare-and-swap

Concurrent AVL tree was invented, but it is so complicated that I really don't know what to say about these :) for example, http://hal.inria.fr/docs/00/07/39/31/PDF/RR-2761.pdf 并发AVL树是发明的,但它太复杂了,我真的不知道该怎么说:)例如, http://hal.inria.fr/docs/00/07/39/31/PDF/ RR-2761.pdf

I just found that concurrent radix tree exists: ftp://82.96.64.7/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf but it is quite complicated too. 我刚刚发现并发基数树存在: ftp//82.96.64.7/pub/linux/kernel/people/npiggin/patches/lockless/2.6.16-rc5/radix-intro.pdf但它也很复杂。

Concurrent sorted arrays doesn't exists of course, you need a reader-writer lock for update. 并发排序数组当然不存在,您需要一个读写器锁来进行更新。

Consider also that the amount of memory required to handle a non-concurrent hashtable can be quite little: For each IP you need 4 byte for the IP and a pointer. 还要考虑处理非并发哈希表所需的内存量可能非常少:对于每个IP,您需要4个字节用于IP和指针。 You need also a big array of pointers (or 32 bit integers with some tricks) which size should be a prime number greater than the number of items that should be stored. 你还需要一个大的指针数组(或带有一些技巧的32位整数),其大小应该是一个大于应该存储的项目数的素数。 Hashtables can of course also resize themselves when required if you want, but they can store also more item than that prime numbers, at the cost of slower lookup time. Hashtables当然也可以根据需要调整自己的大小,但是它们可以存储比素数更多的项目,但代价是查找时间较慢。

For both trees and hashtable, the space complexity is linear. 对于树和哈希表,空间复杂度是线性的。

I hope this is a multithreading application and not a multiprocess application (fork). 我希望这是一个多线程应用程序而不是多进程应用程序(fork)。 If it is not multithreading you cannot share a portion of memory in a fast and reliable way. 如果不是多线程,则无法以快速可靠的方式共享内存的一部分。

The "most efficient" is a hard term to quantify. “最有效”是一个难以量化的术语。 Clearly, if you had unlimited memory, you would have a bin for every IP address and could immediately index into it. 显然,如果你有无限的内存,那么每个IP地址都有一个bin,可以立即索引到它。

A common tradeoff is using a B-tree type data structure. 常见的权衡是使用B树型数据结构。 First level bins could be preset for the first 8 bits of the IP address, which could store a pointer to and the size of a list containing all currently blocked IP addresses. 可以为IP地址的前8位预设第一级容器,其可以存储指向包含所有当前被阻止的IP地址的列表的大小。 This second list would be padded to prevent unnecessary memmove() calls and possibly sorted. 第二个列表将被填充以防止不必要的memmove()调用并可能进行排序。 (Having the size and the length of the list in memory allows an in-place binary search on the list at the slight expensive of insertion time.) (在内存中具有列表的大小和长度允许在列表上进行就地二进制搜索,但插入时间略微昂贵。)

For example: 例如:

127.0.0.1 =insert=> { 127 :: 1 }
127.0.1.0 =insert=> { 127 :: 1, 256 }
12.0.2.30 =insert=> { 12 : 542; 127 :: 1, 256 }

The overhead on such a data structure is minimal, and the total storage size is fixed. 这种数据结构的开销很小,总存储大小是固定的。 The worse case, clearly, would be a large number of IP addresses with the same highest order bits. 更糟糕的情况显然是具有相同最高阶位的大量IP地址。

One way to improve the performance of such a system is to use a Bloom Filter. 提高此类系统性能的一种方法是使用Bloom Filter。 This is a probabilistic data structure, taking up very little memory, in which false positives are possible but false negatives are not. 这是一种概率数据结构,占用的内存很少,其中可能存在误报,但误报则不然。

When you want to look up an IP address, you first check in the Bloom Filter. 如果要查找IP地址,请首先检查Bloom过滤器。 If there's a miss, you can allow the traffic right away. 如果有错过,您可以立即允许交通。 If there's a hit, you need to check your authoritative data structure (eg a hash table or prefix tree). 如果有命中,您需要检查权威数据结构(例如哈希表或前缀树)。

You could also create a small cache of "hits in the Bloom Filter but actually allowed" addresses, that is checked after the Bloom Filter but before the authoritative data structure. 您还可以在Bloom Filter之后但在权威数据结构之前创建一个“Bloom Filter中的命中但实际允许”的小缓存。

Basically the idea is to speed up the fast path (IP address allowed) at the expense of the slow path (IP address denied). 基本上,这个想法是以慢速路径(IP地址被拒绝)为代价加速快速路径(允许IP地址)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM