繁体   English   中英

翻译成 C/NDK/JNI 的代码比原来的 Java 效率低

[英]Code translated to C/NDK/JNI less efficient than Java original

这是我第一次不太深入地研究 NDK。

出于性能目的,我想将此代码重写为 NDK。 我的c文件如下所示:

#include <jni.h>
#include <stdbool.h>
#include <stdio.h>
#include <time.h>
#include <android/log.h>

JNIEXPORT jbyteArray JNICALL
Java_com_company_app_tools_NV21FrameRotator_rotateNV21(JNIEnv *env, jclass thiz,
                                                           jbyteArray data, jbyteArray output,
                                                           jint width, jint height, jint rotation) {
    clock_t start, end;
    double cpu_time_used;
    start = clock();

    jbyte *dataPtr = (*env)->GetByteArrayElements(env, data, NULL);
    jbyte *outputPtr = (*env)->GetByteArrayElements(env, output, NULL);

    unsigned int frameSize = width * height;
    bool swap = rotation % 180 != 0;
    bool xflip = rotation % 270 != 0;
    bool yflip = rotation >= 180;

    for (unsigned int j = 0; j < height; j++) {
        for (unsigned int i = 0; i < width; i++) {
            unsigned int yIn = j * width + i;
            unsigned int uIn = frameSize + (j >> 1u) * width + (i & ~1u);
            unsigned int vIn = uIn + 1;

            unsigned int wOut = swap ? height : width;
            unsigned int hOut = swap ? width : height;
            unsigned int iSwapped = swap ? j : i;
            unsigned int jSwapped = swap ? i : j;
            unsigned int iOut = xflip ? wOut - iSwapped - 1 : iSwapped;
            unsigned int jOut = yflip ? hOut - jSwapped - 1 : jSwapped;

            unsigned int yOut = jOut * wOut + iOut;
            unsigned int uOut = frameSize + (jOut >> 1u) * wOut + (iOut & ~1u);
            unsigned int vOut = uOut + 1;

            outputPtr[yOut] = (jbyte) (0xff & dataPtr[yIn]);
            outputPtr[uOut] = (jbyte) (0xff & dataPtr[uIn]);
            outputPtr[vOut] = (jbyte) (0xff & dataPtr[vIn]);
        }
    }

    (*env)->ReleaseByteArrayElements(env, data, dataPtr, 0);
    (*env)->ReleaseByteArrayElements(env, output, outputPtr, 0);

    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    char str[10];
    sprintf(str, "%f", cpu_time_used * 1000);
    __android_log_write(ANDROID_LOG_ERROR, "NV21FrameRotator", str);

    return output;
}

这两个片段(链接 Java 及以上)都运行良好,但是当我测量处理持续时间时,它看起来在同一设备上 Java 版本大约需要 7 毫秒( Log.i( Java 侧日志)和 C 12-13 毫秒......不应该更快,为什么不是?问题在哪里?

long micros = System.nanoTime() / 1000;
// ~7ms, Java
//data = rotateNV21(inputData, width, height, rotateCameraDegrees);
// ~12-13ms, C
NV21FrameRotator.rotateNV21(inputData, data, width, height, rotateCameraDegrees);
Log.d(TAG, "Last frame processing duration: " + (System.nanoTime() / 1000 - micros) + "µs");

附言。 Java 日志有时显示的持续时间比c文件中的本机clock()测量更短...示例日志:

NV21FrameRotator: 7.942000
NV21RotatorJava: Last frame processing duration: 7403µs
NV21FrameRotator: 7.229000
NV21RotatorJava: Last frame processing duration: 7166µs
NV21FrameRotator: 16.918000
NV21RotatorJava: Last frame processing duration: 20644µs
NV21FrameRotator: 19.594000
NV21RotatorJava: Last frame processing duration: 20479µs
NV21FrameRotator: 9.484000
NV21RotatorJava: Last frame processing duration: 7274µs

编辑: armeabi-v7a compile_commands.json (旧设备,我只构建这个)

[
{
  "directory": "...app/.cxx/cmake/basicRelease/armeabi-v7a",
  "command": "...sdk\\ndk\\21.0.6113669\\toolchains\\llvm\\prebuilt\\windows-x86_64\\bin\\clang.exe --target=armv7-none-linux-androideabi21 --gcc-toolchain=...sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/windows-x86_64 --sysroot=...sdk/ndk/21.0.6113669/toolchains/llvm/prebuilt/windows-x86_64/sysroot -DNV21FrameRotator_EXPORTS  -g -DANDROID -fdata-sections -ffunction-sections -funwind-tables -fstack-protector-strong -no-canonical-prefixes -D_FORTIFY_SOURCE=2 -march=armv7-a -mthumb -Wformat -Werror=format-security  -Oz -DNDEBUG  -fPIC   -o CMakeFiles\\NV21FrameRotator.dir\\NV21FrameRotator.c.o   -c ...app\\src\\main\\cpp\\NV21FrameRotator.c",
  "file": "...app\\src\\main\\cpp\\NV21FrameRotator.c"
}
]

CMakeFile

cmake_minimum_required(VERSION 3.4.1)
add_library(NV21FrameRotator SHARED
    NV21FrameRotator.c)
find_library(log-lib
    log )
target_link_libraries(NV21FrameRotator
    ${log-lib} )

JNI 的开销非常高,尤其是在传递非 POD 类型或缓冲区时。 所以经常调用 JNI function 可能比 java 版本慢得多。

考虑改为传递 java.nio.ByteBuffer 以避免字节数组的潜在副本。

  1. 在真实设备上比较 Java 和 C 的性能,模拟器不会产生可靠的结果。

  2. 比较 Java 和 C 在发布版本上的性能,C 中的调试速度很慢,而 Java 仍然得到完整的 JIT(和 AOT)优化。

  3. 您可能会为您的场景寻找最佳优化选择。 默认情况下,版本将使用-Oz 为了更喜欢速度而不是大小,您可以添加到您的build.gradle

     android { buildTypes { release { externalNativeBuild.cmake.cFlags "-O3" } } }
  4. 您的 C 代码(实际上是原始的 Java 代码)需要进行一些优化。 主要的低效率(据我所知)是您重新计算每个 U 和 V 值四次。 简单的解决方法是拆分循环

  5. 进一步优化可以避免内循环的乘法运算(在外循环中也可以去掉,但影响可以忽略不计):

#include <jni.h>
#include <stdbool.h>
#include <stdio.h>
#include <time.h>
#include <android/log.h>

JNIEXPORT jbyteArray JNICALL
Java_com_company_app_tools_NV21FrameRotator_rotateNV21(JNIEnv *env, jclass thiz,
                                                       jbyteArray data, jbyteArray output,
                                                       jint width, jint height, jint rotation) {
    clock_t start, end;
    double cpu_time_used;
    start = clock();

    jbyte *dataPtr = (*env)->GetByteArrayElements(env, data, NULL);
    jbyte *outputPtr = (*env)->GetByteArrayElements(env, output, NULL);

    unsigned int frameSize = width * height;
    bool swap = rotation % 180 != 0;
    bool xflip = rotation % 270 != 0;
    bool yflip = rotation >= 180;

    unsigned int wOut = swap ? height : width;
    unsigned int hOut = swap ? width : height;
    unsigned int yIn = 0;

    for (unsigned int j = 0; j < height; j++) {

        unsigned int iSwapped = swap ? j : 0;
        unsigned int jSwapped = swap ? 0 : j;
        unsigned int iOut = xflip ? wOut - iSwapped - 1 : iSwapped;
        unsigned int jOut = yflip ? hOut - jSwapped - 1 : jSwapped;
        unsigned int yOut = jOut * wOut + iOut;

        for (unsigned int i = 0; i < width; i++) {
            outputPtr[yOut] = dataPtr[yIn];
            if (swap) {
                yOut += yflip ? -wOut : wOut;
            } else {
                yOut += xflip ? -1 : 1;
            }
            yIn++;
        }
    }

    unsigned int uIn = frameSize;

    for (unsigned int j = 0; j < height; j+=2) {

        unsigned int iSwapped = swap ? j : 0;
        unsigned int jSwapped = swap ? 0 : j;
        unsigned int iOut = xflip ? wOut - iSwapped - 1 : iSwapped;
        unsigned int jOut = yflip ? hOut - jSwapped - 1 : jSwapped;
        unsigned int uOut = frameSize + (jOut / 2) * wOut + (iOut & ~1u);

        for (unsigned int i = 0; i < width; i+=2) {
            unsigned int vIn = uIn + 1;
            unsigned int vOut = uOut + 1;

            outputPtr[uOut] = dataPtr[uIn];
            outputPtr[vOut] = dataPtr[vIn];

            if (swap) {
                uOut += yflip ? -wOut : wOut;
            } else {
                uOut += xflip ? -2 : 2;
            }
            uIn += 2;
        }
    }

    (*env)->ReleaseByteArrayElements(env, data, dataPtr, JNI_ABORT);
    (*env)->ReleaseByteArrayElements(env, output, outputPtr, 0);

    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;

    __android_log_print(ANDROID_LOG_ERROR, "NV21FrameRotator", "%.1f ms", cpu_time_used * 1000);

    return output;
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM