简体   繁体   English

按间隔汉明码

[英]Hamming numbers by intervals

Here's a somewhat different approach to generating the sequence of Hamming numbers (aka regular numbers , 5-smooth numbers ) based on the interval from one number in the sequence to the next. 这是一种基于从序列中一个数字到下一个数字的间隔来生成汉明数字序列(又称为正数5平滑数字 )的方法,有些不同。 Here's an example plot of said intervals: 这是上述间隔的示例图:

在此处输入图片说明

So there is a relatively limited number of discrete intervals separating one number from the next, and the intervals get smaller as H increases. 因此,数量相对有限的离散间隔将一个数与下一个分隔开,并且随着H的增加,间隔变小。 It's often noted that Hamming numbers get sparser as they increase in size, which they do in absolute terms, but in another sense (proportionally) they get closer together. 人们经常注意到,汉明数随着大小的增加而变得稀疏,这绝对值是绝对的,但是从另一种意义上(按比例),它们变得更近了。

Basically, as H goes up there is greater opportunity for 2^i*3^j*5^k where i,j,k are positive or negative integers to result in a fraction near 1.0. 基本上,随着H的上升,2 ^ i * 3 ^ j * 5 ^ k的机会更大,其中i,j,k是正整数或负整数,导致分数接近1.0。

Turns out that a table of just 119 intervals (i,j,k triples) covers Hamming numbers up to about 10^10000. 事实证明,只有119个间隔(i,j,k三元组)的表涵盖了大约10 ^ 10000的汉明数。 That's about the first 1.59 trillion Hamming numbers. 那是大约前1.59万亿的汉明数字。 Such a table (C header file), sorted by the interval size from small to large, is here . 这样的表(C头文件)按时间间隔大小(从小到大)排序在此处 Given a Hamming number, to find the next one all that's required is to find the first entry in the table where multiplication (addition of respective exponents) would yield a result with positive powers for i,j and k. 给定一个汉明数,要查找下一个汉明数,只需找到表中的第一个条目,其中乘法(各个指数的加法)将得出i,j和k为正幂的结果。

Eg, the millionth Hamming number is 2^55*3^47*5^64 which is about 5.1931278e83. 例如,百万分之一的汉明数是2 ^ 55 * 3 ^ 47 * 5 ^ 64,约为5.1931278e83。 The next Hamming number after that is 2^38*3^109*5^29 or about 5.1938179e83. 之后的下一个汉明数是2 ^ 38 * 3 ^ 109 * 5 ^ 29或大约5.1938179e83。 The first appropriate table entry is: 第一个合适的表条目是:

{-17,62,-35}, // 1.000132901540844 {-17,62,-35},// 1.000132901540844

So while those numbers are separated by about 7e79, their ratio is 1.000132901540844. 因此,尽管这些数字之间相隔约7e79,但它们的比率为1.000132901540844。 To find the next number required just trying up to 119 entries in the worst case, involving just additions and comparisons (no multiplications). 要查找下一个数字,在最坏的情况下最多只需尝试119个条目即可,仅涉及加法和比较(无乘法)。 Also, the table of just 3 short ints per entry requires under 1kb memory. 此外,每个条目只有3个短整数的表需要1kb以下的内存。 The algorithm is basically O(1) in memory and O(n) in time, where n is the length of the sequence. 该算法基本上是内存中的O(1)和时间上的O(n),其中n是序列的长度。

One way to speed it up would be to rather than searching the table from the 0th index every time, constrain the list of table entries to search to just those entries that empirically are known to succeed the given entry in the given range (n < 1.59e12). 一种加快速度的方法是,而不是每次都从第0个索引中搜索表,而是将表项列表限制为仅搜索凭经验已知在给定范围内成功替换给定项的那些项(n <1.59) E12)。 Those lists are given in the header file above in the succtab[] struct, eg: 这些列表在succtab []结构的上方头文件中给出,例如:

{11,{47,55,58,65,66,68,70,72,73,75,76}}, {11 {47,55,58,65,66,68,70,72,73,75,76}},

So that particular index is empirically found to only be followed by 11 different indices as listed, so those are the only ones searched. 因此,根据经验发现该特定索引仅跟随列出的11个不同的索引,因此仅搜索这些索引。

Doing that speeds up the algorithm by a factor of 4 or so, implemented here (C code) along with the header file above. 这样做可以将算法加速4倍左右, 在此处 (C代码)与上面的头文件一起实现。 Here's a plot of the execution time on an i7-2600 3.4GHz machine: 这是在i7-2600 3.4GHz机器上执行时间的曲线图:

在此处输入图片说明

I believe that compares favorably with the state of the art--is that so? 我认为这可以与最新技术相提并论-是吗?

The Hamming problem is sometimes reduced to just finding the nth Hamming number without generating all the intermediate values. 汉明问题有时被简化为仅找到第n个汉明数而不生成所有中间值。 Adapting the above technique to a well-known scheme of just enumerating the Hamming numbers in a band around the desired range gives this plot of execution time: 将上述技术应用于仅枚举所需范围附近频带中的汉明数的公知方案即可得出以下执行时间图: 在此处输入图片说明

So that takes less than 2 seconds to find the 1.59 trillionth Hamming number. 因此,只需不到2秒即可找到1.59万亿个汉明数。 The C code for that is here . 这里的C代码在这里 Does this also compare favorably with the state of the art, at least in the given bounds? 至少在给定范围内,这是否也与现有技术水平相称?

EDIT: the bounds for n (1.59e12, Hamming numbers up to about 10^10000) were chosen based on a specific machine, where it was desired that i,j,k be short ints and also reasonable expectation on execution speed. 编辑:n的界限(1.59e12,汉明数最多约10 ^ 10000)是基于特定机器选择的,其中希望i,j,k为短整数,并且对执行速度也有合理的期望。 Larger tables could be generated, eg a table of 200 entries would allow n to be as high as about 1e18 (Hamming numbers up to about 10^85000). 可以生成更大的表,例如200个条目的表将允许n高达约1e18(Hamming数高达约10 ^ 85000)。

Another question would be how to speed it up further. 另一个问题是如何进一步加快速度。 One potential area: it turns out that some table entries are hit much more often than others, and they have a correspondingly larger list of successors to check. 一个潜在的领域:事实证明,某些表条目的命中率要比其他表条目高得多,并且它们具有相应更大的后继列表供检查。 For example, when generating the first 1.59e12 numbers, this entry is hit by fully 46% of the iterates: 例如,当生成前1.59e12个数字时,此条目被46%的迭代次数击中:

{-7470,2791,1312} {} -7470,2791,1312

It has 23 possible different successors. 它有23个可能的不同后继者。 Perhaps some way of narrowing that down based on other parameters (eg, history of the previous entries traversed) would help, although there wouldn't be much room for an expensive operation. 也许可以采用一些基于其他参数(例如,遍历先前条目的历史记录)缩小范围的方法,尽管对于昂贵的操作而言并没有太大的余地。

EDIT #2: 编辑#2:

For some info about generating the table, there are basically six classes of fractions 2^i*3^j*5^k where i,j,k are positive or negative integers: fractions with only 2,3 or 5 in the numerator, and fractions with only 2,3, or 5 in the denominator. 有关生成表的一些信息,基本上有六类分数2 ^ i * 3 ^ j * 5 ^ k,其中i,j,k是正整数或负整数:分子中只有2,3或5的分数,分母中只有2,3或5的分数。 Eg, for the class with only 2 in the numerator: 例如,对于分子中只有2的类:

f = 2^i/(3^j*5^k), i > 0 and j,k >= 0 f = 2 ^ i /(3 ^ j * 5 ^ k),i> 0和j,k> = 0

AC program to compute the intervals for this class of fraction is here . AC程序计算区间此类分数是在这里 For Hamming numbers up to about 10^10000 it runs in a few seconds. 对于汉明数字,大约10 ^ 10000,它会在几秒钟内运行。 It could probably be made more efficient. 它可能会变得更有效率。

Repeating a similar process for the other 5 classes of fractions yields six lists. 对其他5类馏分重复类似的过程,得到六个列表。 Sorting them all together by the interval size and removing duplicates yields the complete table. 将它们按时间间隔大小排序在一起,然后删除重复项,即可得出完整的表格。

The triples enumeration is ~ n 2/3 but the sorting of the band is ~ n 2/3 log (n 2/3 ) ie ~ n 2/3 log n . 三元组枚举为〜n 2/3,但带的排序为〜n 2/3 log(n 2/3 ),〜n 2/3 log n This obviously doesn't change even with ~ n 1/3 band space scheme. 即使使用〜n 1/3频段的空间方案,这显然也不会改变。

Indeed the empirical complexities are seen in practice as ~ n 0.7 . 实际上,经验上的复杂性在实践中被视为〜n 0.7

I am yet to understand your algorithm fully, but the evidence you presented strongly suggests the pure ~ n 2/3 operation, which would constitute a clear and significant improvement over the previous state of the art, absolutely. 我尚未完全理解您的算法,但是您提供的证据强烈表明,纯粹的〜n 2/3操作绝对可以对以前的现有技术做出明显而重大的改进

在此处输入图片说明

This would be not so, in my opinion, if it was needed to generate the whole sequence in order to find the "intervals" (ratios) your algorithm is based on. 在我看来,如果需要生成整个序列以找到算法所基于的“间隔”(比率),事实并非如此。 But since you generate them independently, as your later edit seems to suggest, it's no impediment at all. 但是由于您是独立生成它们的,因此正如您以后的编辑所建议的那样,这完全没有障碍。

Correction : if we're only interested in the n th member of the sequence, then full sort of the band is not needed; 更正 :如果我们只对序列的第n个成员感兴趣,则不需要完整的波段; O(n) select-kth-largest algorithms do exist. 确实存在O(n)个 选择数最大的算法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM