简体繁体中英

How does arithmetic or elementary function operation latency scales with the number of bits?

原文 2015-05-30 00:10:09 8 1 performance/ floating-point/ computer-architecture/ cpu-architecture/ word-size

Notice that the ratio between 64-bit and 32-bit float ops is different on different hardware. For example, recently NVidia improved 64-bit performance while 32-bit remained unchanged. That made me curious: given sufficiently wide datapath, what are the factors that determine by how much certain floating point ops have to be when you double the number of bits?

For the purpose of this question assume that you can significantly increase the width of your datapath when you double the number of bits. Not unlimited (otherwise a lookup table would theoretically be possible for any arithmetic function), but wide enough to perform arithmetic operation in parallel on independent bits. Given that, by what factor would doubling word size would slow down arithmetic operations +,*,/? And what about built-in elementary functions such as log,exp,sin,atan?

EDIT:

Let me explain more clear what I am asking here.

First of all, it's known that if theoretically one has unlimited circuitry/area one can compute any mathematical operation on N bit input(s) in O(log N). All one has to do is to create a huge hash table of size 2^N (for 1-operand functions such as sin(x)) or 2^(2*N) (for 2-operand functions) and lookup the desired value using the input as the hash key. Needless to say this is completely impractical, and I am not interested in an answer like that. However, that shows that one cannot theoretically prove that any operation would necessarily require more than O(log N) time, given arbitrary width of the datapath.

Second, it's also known that Omega(log N) is a lower bound for even relatively simple operation such as an adder. This has to do with the depth of dependencies between the output bits, and therefore the depth of the circuit.

The question really is: given a reasonable bound on the size of the circuit (say, no more than polynomial(N) gates) what would be the asymptotic behavior of the latency of the optimal circuit implementing arithmetic and elementary function operations?

The answer is known to be O(log N) for adder, realized by carry lookahead adder. I don't know the answer for multiplication, but suspect one can implement it as a O(log N) circuit as well because multiplication boils down to const time booleans followed by adding multiple operands, and extending carry lookahead to a multi-operand adder doesn't seem too difficult.

I have no clue what would be the asymptotic for division and square root.

I am also curious about common elementary functions, such as log, exp, sin, etc.

1 answers

There are two dimensions in which increasing logic complexity will affect circuit delay. One is impact on pipeline stages, where one or more combinational delays will be "critical paths" constraining the minimum clock period. Almost arbitrarily (albeit with varying amounts of work), you can take a complex circuit and pipeline it in any number of stages. More stages will chop up the logic more, increasing the latency in cycles but also reducing the min clock period, which increases throughput. Note that as you increase stages, you run into diminishing returns, because the pipeline registers have constant overhead. Also, more pipeline stages means that dependent instructions have to wait longer to get their inputs, although that doesn't affect GPUs so much because of the high thread parallelism.

Just to get it out of the way, I'll mention that increasing your circuit area will always indirectly affect performance. Larger circuits mean more complex placement and routing, and that's going to mean that combinational delays will not scale linearly with the number of logic gates. We'll ignore that for now.

Doubling the datapath width for some things won't have any impact on combinational delay. For instance, if you have a bit-wise AND operation, every bit is computed independently. So in the abstract, doubling your datapath width will not affect your cycle time.

Now, you're asking about floating point, but a floating point pipeline is going to be composed of integer blocks that do things like add (and subtract), multiply, and shift. I'm going based on memory here, so someone may need to correct me, but here goes.

A carry look-ahead add or sub unit will generally increase logarithmically with the number of bits, so doubling the datapath width will (again, ignoring the impact of placement and routing) only increase delay a little bit.

IIRC, a barrel shifter has the same growth rate as add/sub.

A multiplier will increase linearly with width because it's more or less a 2D array of full adders, but some optimizations can be made. So if you double the datapath width, I think you'll double the circuit delay. So in this case, you may want to pipeline your multiplier into two stages.

How does the time it takes Math.random() to run compare to that of simple arithmetic operation?

How to count the number of arithmetic operations in the following expression?

How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows?

How to find number of bits used in address aliasing by the cpu?

How to measure latency distribution

How dstat measure latency?

How to optimize this simple function which translates input bits into words?

How does warp work with atomic operation?

Does calling a numpy function in a vectorized operation affect performance?

How to speed up A* algorithm at large spatial scales?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How does the time it takes Math.random() to run compare to that of simple arithmetic operation? How to count the number of arithmetic operations in the following expression? How can x86 bsr/bsf have fixed latency, not data dependent? Doesn't it loop over bits like the pseudocode shows? How to find number of bits used in address aliasing by the cpu? How to measure latency distribution How dstat measure latency? How to optimize this simple function which translates input bits into words? How does warp work with atomic operation? Does calling a numpy function in a vectorized operation affect performance? How to speed up A* algorithm at large spatial scales?

Related Tags

How does arithmetic or elementary function operation latency scales with the number of bits?

Question

1 answers

solution1 1 2015-06-03 14:57:21

solution1
1 2015-06-03 14:57:21