Optimal frequency of modulo operation in finite field arithmetic implementation

Question

I'm trying to implement finite field arithmetic to use it in Elliptic Curve calculations. Since all that's ever used are arithmetic operations that commute with the modulo operator, I don't see a reason not to delaying that operation till the very end. One thing that may happen is that the numbers involved might become (way) too big and impractical/inefficient to work with, but I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.

I'm coding in C.

Answer 1

To avoid the complexity of elliptic curve crypto (as I'm unfamiliar with its algorithm); let assume you're doing temp = (a * b) % M; result = (temp * c) % M temp = (a * b) % M; result = (temp * c) % M , and you're thinking about just doing result = (a * b * c) % M instead.

Let's also assume that you're doing this a lot with the same modulo M ; so you've precomputed "multiples of M" lookup tables, so that your modulo code can use the table to find the highest multiple of "M shifted left by N" that is not greater than the dividend and subtract it from dividend, and repeat that with decreasing values of N until you're left with the quotient.

If your lookup table has 256 entries, the dividend is 4096 bits and the divisor is 2048 bits; then you'd reduce the size of the dividend by 8 bits per iteration, so dividend would become smaller than the divisor (and you'd find the quotient) after no more than 256 "search and subtract" operations.

For multiplication; it's almost purely "multiply and add digits" for each pair of digits. Eg using uint64_t as a digit, multiplying 2048 bit numbers is multiplying 32 digit numbers and involves 32 * 32 = 1024 of those "multiply and add digits" operations.

Now we can make comparisons. Specifically, assuming a , b , c , M are 2048-bit numbers:

a) the original temp = (a * b) % M; result = (temp * c) % M temp = (a * b) % M; result = (temp * c) % M would be 1024 "multiply and add", then 256 "search and subtract", then 1024 "multiply and add", then 256 "search and subtract". For totals it'd be 2048 "multiply and add" and 512 "search and subtract".

b) the proposed result = (a * b * c) % M would be 1024 "multiply and add", then would be 2048 "multiply and add" (as the result of a*b will be a "twice as big" 4096-bit number), then 512 "search and subtract" (as a*b*c will be twice as big as a*b ). For totals it'd be 3072 "multiply and add" and 512 "search and subtract".

In other words; (assuming lots of assumptions) the proposed result = (a * b * c) % M would be worse, with 50% more "multiply and add" and the exact same "search and subtract".

Of course none of this (the operations you need for elliptic curve crypto, the sizes of your variables, etc) can be assumed to apply for your specific case.

I was wondering if there was a way to determine the optimal conditions/frequency which should trigger a modulo operation in the calculations.

Yes; the way to determine the optimal conditions/frequency is to do similar to what I did above - determine the true costs (in terms of lower level operations, like my "search and subtract" and "multiply and add") and compare them.

In general (regardless of how modulo is implemented, etc) I'd expect you'll find that doing modulo as often as possible is the fastest option (as it reduces the cost of multiplications and also reduces the cost of later/final modulo) for all cases don't involve addition or subtraction, and that don't fit in simple integers.

Answer 2

If M is a constant, then an alternative for modulo is to multiply by the logical inverse of M . Looking at Polk's comment about 256 bits being a common case, then assuming M is polynomial of degree 256 with 1 bit coefficients, then define the inverse of M to be x^512 / M, which results in a 256 bit "inverse". Name this inverse to be I . Then for a multiply modulo M :

C = A * B                            ; 512 bit product
Q = (upper 256 bits of C * I)>>256   ; Q = C / M = 256 bit quotient
P = M * Q                            ; 512 bit product
R = lower 256 bits of (C xor P)      ; (A * B)% M

So this require 3 extended precision multiplies and one xor.

If the processor for this code has a carryless multiply, such as X86 PCLMULQDQ, which multiplies two 64 bit operands to produce a 128 bit result, then that could be used as the basis for an extended precision multiply. A basic implementation would need 16 multiplies for a 256 bit by 256 bit multiply to produce a 512 bit product. This could be improved using somthing like Karatsuba:

https://en.wikipedia.org/wiki/Karatsuba_algorithm

but on currernt X86, PCLMULQDQ is fast, taking 1 to 3 cycles, so the main issue would be loading the data into the XMM registers, and I'm not sure Karatsuba would save much time.

Answer 3

optimal conditions/frequency which should trigger a modulo operation in the calculations

Standard practice is to replace all actual modulo operations with something else. So the frequency is never. There are different ways to accomplish that:

Choose the modulus to be a Mersenne prime or pseudo-Mersenne prime. There is a large repertoire of mathematical tricks to implement arithmetic modulo a (pseudo-)Mersenne prime efficiently, without doing any actual modulo operations. In the context of elliptic curves, the prime-modulus NIST curves are chosen this way and for this reason.
Use Barrett reduction. This has the same effect as a real modulo operation, but relies on some precomputation and a precondition on the range of the input to be able to reduce the cost of a modulo-like operation to the cost to a couple of multiplications (plus some supporting operations). Also applicable to polynomial fields.
Do arithmetic in Montgomery form.

Additionally, and perhaps more in the spirit of your question, a common technique is to do various additions without reducing every time (addition does not significantly change the size of a number). It takes a lot of additions before you need an extra limb in your integers, so a lot of them can be done before it starts to make sense to reduce. For multiplications, unless it's by a small constant it almost always makes sense to reduce immediately afterwards to prevent the numbers from getting much physically larger than they need to be (which would be especially bad if the result was fed into another multiplication).

Another technique especially associated with Barrett reductions is to work, most of the time, in a slightly larger range than [0.. N), eg [0.. 2N). This enables skipping the conditional subtraction that Barrett reduction needs in order to fully reduce to the range [0.. N), while still using the most important part, the reduction from the range [0.. N²) to the range [0.. 2N).

Optimal frequency of modulo operation in finite field arithmetic implementation

Question

3 answers

solution1
0 ACCPTED 2022-07-22 02:40:04

solution2
0 2022-07-23 09:45:01

solution3
0 2022-08-08 16:50:22

Optimal frequency of modulo operation in finite field arithmetic implementation

Question

3 answers

solution1 0 ACCPTED 2022-07-22 02:40:04

solution2 0 2022-07-23 09:45:01

solution3 0 2022-08-08 16:50:22

solution1
0 ACCPTED 2022-07-22 02:40:04

solution2
0 2022-07-23 09:45:01

solution3
0 2022-08-08 16:50:22