Java - Optimize function that contains mostly simple maths

Question

I'm supposed to try and optimize this method for my team, working on a video decoder in Java, though I don't see any good approach to doing so. The function, below, doesn't seem like it could be speed up in any significant amount, since it contains mostly simple addition/subtraction/etc.

void inverseTransform(int macroBlockIndex, int dataBlockIndex) {
    int[] workSpace = new int[64];
    short[] data = new short[64];

    int z1, z2, z3, z4, z5;
    int tmp0, tmp1, tmp2, tmp3;
    int tmp10, tmp11, tmp12, tmp13;

    int pointer = 0;

    for (int index = 8; index > 0; index--) {
        if (dataBlockBuffer[pointer + 8] == 0 && dataBlockBuffer[pointer + 16] == 0 && dataBlockBuffer[pointer + 24] == 0 && dataBlockBuffer[pointer + 32] == 0 && dataBlockBuffer[pointer + 40] == 0 && dataBlockBuffer[pointer + 48] == 0 && dataBlockBuffer[pointer + 56] == 0) {
            int dcValue = dataBlockBuffer[pointer] << PASS1_BITS;

            workSpace[pointer + 0] = dcValue;
            workSpace[pointer + 8] = dcValue;
            workSpace[pointer + 16] = dcValue;
            workSpace[pointer + 24] = dcValue;
            workSpace[pointer + 32] = dcValue;
            workSpace[pointer + 40] = dcValue;
            workSpace[pointer + 48] = dcValue;
            workSpace[pointer + 56] = dcValue;

            pointer++;
            continue;
        }

        z2 = dataBlockBuffer[pointer + 16];
        z3 = dataBlockBuffer[pointer + 48];

        z1 = (z2 + z3) * FIX_0_541196100;
        tmp2 = z1 + z3 * -FIX_1_847759065;
        tmp3 = z1 + z2 * FIX_0_765366865;

        z2 = dataBlockBuffer[pointer];
        z3 = dataBlockBuffer[pointer + 32];

        tmp0 = (z2 + z3) << BITS;
        tmp1 = (z2 - z3) << BITS;

        tmp10 = tmp0 + tmp3;
        tmp13 = tmp0 - tmp3;
        tmp11 = tmp1 + tmp2;
        tmp12 = tmp1 - tmp2;

        tmp0 = dataBlockBuffer[pointer + 56];
        tmp1 = dataBlockBuffer[pointer + 40];
        tmp2 = dataBlockBuffer[pointer + 24];
        tmp3 = dataBlockBuffer[pointer + 8];

        z1 = tmp0 + tmp3;
        z2 = tmp1 + tmp2;
        z3 = tmp0 + tmp2;
        z4 = tmp1 + tmp3;
        z5 = (z3 + z4) * FIX_1_175875602;

        tmp0 = tmp0 * FIX_0_298631336;
        tmp1 = tmp1 * FIX_2_053119869;
        tmp2 = tmp2 * FIX_3_072711026;
        tmp3 = tmp3 * FIX_1_501321110;
        z1 = z1 * -FIX_0_899976223;
        z2 = z2 * -FIX_2_562915447;
        z3 = z3 * -FIX_1_961570560;
        z4 = z4 * -FIX_0_390180644;

        z3 += z5;
        z4 += z5;

        tmp0 += z1 + z3;
        tmp1 += z2 + z4;
        tmp2 += z2 + z3;
        tmp3 += z1 + z4;

        workSpace[pointer + 0] = ((tmp10 + tmp3 + (1 << F1)) >> F2);
        workSpace[pointer + 56] = ((tmp10 - tmp3 + (1 << F1)) >> F2);
        workSpace[pointer + 8] = ((tmp11 + tmp2 + (1 << F1)) >> F2);
        workSpace[pointer + 48] = ((tmp11 - tmp2 + (1 << F1)) >> F2);
        workSpace[pointer + 16] = ((tmp12 + tmp1 + (1 << F1)) >> F2);
        workSpace[pointer + 40] = ((tmp12 - tmp1 + (1 << F1)) >> F2);
        workSpace[pointer + 24] = ((tmp13 + tmp0 + (1 << F1)) >> F2);
        workSpace[pointer + 32] = ((tmp13 - tmp0 + (1 << F1)) >> F2);

        pointer++;
    }

    pointer = 0;

    for (int index = 0; index < 8; index++) {
        z2 = workSpace[pointer + 2];
        z3 = workSpace[pointer + 6];

        z1 = (z2 + z3) * FIX_0_541196100;
        tmp2 = z1 + z3 * -FIX_1_847759065;
        tmp3 = z1 + z2 * FIX_0_765366865;

        tmp0 = (workSpace[pointer + 0] + workSpace[pointer + 4]) << BITS;
        tmp1 = (workSpace[pointer + 0] - workSpace[pointer + 4]) << BITS;

        tmp10 = tmp0 + tmp3;
        tmp13 = tmp0 - tmp3;
        tmp11 = tmp1 + tmp2;
        tmp12 = tmp1 - tmp2;

        tmp0 = workSpace[pointer + 7];
        tmp1 = workSpace[pointer + 5];
        tmp2 = workSpace[pointer + 3];
        tmp3 = workSpace[pointer + 1];

        z1 = tmp0 + tmp3;
        z2 = tmp1 + tmp2;
        z3 = tmp0 + tmp2;
        z4 = tmp1 + tmp3;

        z5 = (z3 + z4) * FIX_1_175875602;

        tmp0 = tmp0 * FIX_0_298631336;
        tmp1 = tmp1 * FIX_2_053119869;
        tmp2 = tmp2 * FIX_3_072711026;
        tmp3 = tmp3 * FIX_1_501321110;
        z1 = z1 * -FIX_0_899976223;
        z2 = z2 * -FIX_2_562915447;
        z3 = z3 * -FIX_1_961570560;
        z4 = z4 * -FIX_0_390180644;

        z3 += z5;
        z4 += z5;

        tmp0 += z1 + z3;
        tmp1 += z2 + z4;
        tmp2 += z2 + z3;
        tmp3 += z1 + z4;

        data[pointer + 0] = (short) ((tmp10 + tmp3) >> F3);
        data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
        data[pointer + 1] = (short) ((tmp11 + tmp2) >> F3);
        data[pointer + 6] = (short) ((tmp11 - tmp2) >> F3);
        data[pointer + 2] = (short) ((tmp12 + tmp1) >> F3);
        data[pointer + 5] = (short) ((tmp12 - tmp1) >> F3);
        data[pointer + 3] = (short) ((tmp13 + tmp0) >> F3);
        data[pointer + 4] = (short) ((tmp13 - tmp0) >> F3);

        pointer += 8;
    }
    short[] temp = imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex];
    for (int i = 0; i < data.length; i++)
        temp[i] = data[i]; //imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex][i] = data[i];
}

Should I combine the basic math if I can, or what would you suggest?

Answer 1

Without specific evidence of performance problems I can see couple of possible improvements (although not sure how much improvement you will gain from them).

Replace repeating calculations with variables. For example pointer+7 repeats few times. You can calculate it once.
Use System.arraycopy() to copy array at the end.

Answer 2

I can't see anything obvious. In addition to what Alex said there are two small suggestions which might possibly help:

1) The long if statement in the first loop has a number of failure conditions. Have you ordered it so the one most likely to fail comes first? With short-circuit evaluation, the earlier you can find a false the less work there is to do to evaluate the whole expression.

2) You are declaring a lot of variables outside the two for-loops, and I can see why you have done that. It is possible that the JVM will be more able to optimise things if you move the declarations inside the two loops so the variables are declared as locally as possible.

For both of these you need to do some timing runs to see if they make a real difference. You might also want to use a profiler to see where the code spends most of its time.

I have one other comment. In lines like:

data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);

you are using >> rather than >>> to bitshift a possibly negative number. Are you sure that is what you want to do if tmp3 > tmp10?

Answer 3

Like other posters, I think there's very little you can do to optimise this as such.

Just out of desparation, I would possibly try reading all of the values (dataBlockBuffer[x]) into local variables BEFORE the condition, and then changing your condition to:

if ((data0 | data1 | data2 |...) == 0)...

That way, you have potentially fewer branches and where the condition fails (presumably most of the time?) the data is already in local variables ready.

But that said, I don't think you're going to shave very much off.

Another very marginal thing to consider is that wrapping a ByteBuffer around an array allows you to potentially read/write several values at once if you can see a way to do this. (At least when writing 'dcValue' multiple times you could write a long instead in case the JIT compiler doesn't make this optimisation).

On a desktop, I might have considered:

multithreading to process multiple blocks at once (most people have at least dual core these days, but guess maybe not on Android);
looking at the JIT-compiled output and making sure the JIT compiler isn't doing anything overtly silly, and then re-working code accordingly (you can do this with the debug JDK, but don't know if there's an equivalent for Android);
the possibility of nativising, and then using the GPU/whatever judicious pragmas your C compiler might have to compile to particular SSE/equivalent instructions (I'm not an expert in this and don't know how possible it would be in your case).

I would consider the last two as kind of "extreme" options anyway-- with a risk of investing significant time for little gain.

Java - Optimize function that contains mostly simple maths

Question

3 answers

solution1
1 2011-07-11 17:04:53

solution2
1 ACCPTED 2011-07-11 17:54:31

solution3
1 2011-07-11 18:26:40

Java - Optimize function that contains mostly simple maths

Question

3 answers

solution1 1 2011-07-11 17:04:53

solution2 1 ACCPTED 2011-07-11 17:54:31

solution3 1 2011-07-11 18:26:40

solution1
1 2011-07-11 17:04:53

solution2
1 ACCPTED 2011-07-11 17:54:31

solution3
1 2011-07-11 18:26:40