简体   繁体   中英

OpenMP offloading with Intel oneAPI DPC++ compiler to NVIDIA GPU

I'm on a mission to write a program with OpenMP offloading to a GPU . At the moment I compile my code with Intel oneAPI DPC++ compiler icpx v2022.1.0 and aim to utilise an NVIDIA Tesla V100 at the backend. Please find below the relevant parts of my Makefile :

MKLROOT   = /lustre/system/local/apps/intel/oneapi/2022.2.0/mkl/latest

CXX       = icpx
INC       =-I"${MKLROOT}/include"
CXXFLAGS  =-qopenmp -fopenmp-targets=spir64 ${INC} --gcc-toolchain=/lustre/system/local/apps/gcc9/9.3.0
LDFLAGS   =-qopenmp -fopenmp-targets=spir64 -fsycl -L${MKLROOT}/lib/intel64
LDLIBS    =-lmkl_sycl -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lsycl -lOpenCL -lstdc++ -lpthread -lm -ldl

${EXE}: ${OBJ}
    ${CXX} ${CXXFLAGS} $^ ${LDFLAGS} ${LDLIBS} -o $@

The code compiles without errors and warnings, but I'm not entirely sure it does use the GPU when it runs.

  1. How can I verify that? Can I use an Intel or an NVIDIA profiler to check that?
  2. Is my assumption correct, that Intel compiler supports offloading to an NVIDIA GPU?
  3. Or should I better use an NVIDIA compiler to enable OpenMP offloading to NVIDIA graphics cards?

How can I verify that? Can I use an Intel or an NVIDIA profiler to check that?

On systems with Nvidia GPUs like a V100, you can use nvidia-smi so to check the state of the GPU. You can also use profilers like the Nsight suite (or the old deprecated nvvp ).

Is my assumption correct, that Intel compiler supports offloading to an NVIDIA GPU?

According to Intel , it is supported:

The OpenMP* Offload to GPU feature of the Intel® oneAPI DPC++/C++ Compiler and the Intel® Fortran Compiler compiles OpenMP source files for a wide range of accelerators. Only the icx and ifx compilers support the OpenMP Offload feature.

As far as I understand, they generate either a Clang-based intermediate-code for GPU, or a SPIR64 binary.

The former can certainly be used on Nvidia GPU according to Nvidia (despite the lack of information provided by Intel and Nvidia).

The later is related to the SPIR standard. Indeed, AFAIK, DPC++ is an implementation of the open SYCL standard which can produce code for the SPIR-V ecosystem . SPIR means Standard Portable Intermediate Representation. It is meant for high-level languages to produce one unified portable code for many back-end. Hardware vendors have then to support it so for all high-level languages/tools to support the vendor. Thus, the vendor does not have to support high-level languages/tools directly.

While I did not found any information provided by Nvidia supporting SPIR-V directly, SPIR codes can be executed on devices supporting recent version (>=1.2) of OpenCL and Vulkan. Fortunately, Nvidia recently claimed to support OpenCL 3.0 .

Put it shortly, it should work on the target Nvidia GPU though it might not be simple to do yet.

Or should I better use an NVIDIA compiler to enable OpenMP offloading to NVIDIA graphics cards?

The mainstream Nvidia compiler wrapper nvcc is meant to support CUDA codes that basically work only on Nvidia GPUs (with a great support). LLVM should support Nvidia GPUs (using the CUDA ecosystem), but the setup can be a bit tricky (and you need a recent version of the toolchain to avoid many issues). GCC, when built with the right flags and dependencies, supports OpenACC offloading to Nvidia PTX since version 5 and OpenMP offloading to PTX since version 7. Besides, while Nvidia does not support OpenMP offloading in their compiler wrapper nvcc , it also distributes the nvc and nvc++ compilers (formerly known as PGI HPC compilers) with OpenMP and OpenACC offloading.

Note that OpenMP offloading is still quite new and rather experimental though some vendors appear to provide a good support so far.

As there is a lot of active development in this space, the answer to the question which compiler is best for offloading to NVIDIA GPUs will probably vary over time/versions (as well as the application). So if you want to be sure you are getting the best performance, you will need to benchmark the most recent versions of the different compilers (See Jérôme Richard's answer) with your specific application and keep doing so in the future.

Depending on size and complexity of your application one might argue that the time this takes could be spent better on implementing CUDA kernels but on the other hand a bad CUDA implementation is potentially as slow as what the "worst compiler" is generating from OpenMP.

There are papers benchmarking different OpenMP implementations, but at this point in time I have not found any including the Intel compiler used by OP. The results in Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs (2020) are probably not very meaningful anymore.

Portability for GPU-accelerated molecular docking applications for cloud and HPC: can portable compiler directives provide performance across all platforms? (2022) might be worth looking into for getting an overview of implementations, optimizations and portable alternatives to OpenMP.

All that being said, if you have no other reason for using the DPC++ compiler, and do not want to do all that benchmarking, I would rather go for either one of the big, established FOS toolchains (GCC or Clang) due to the big user base or for the NVIDIA HPC compilers due to their interest in being fast on their own hardware. Until the Intel compiler is more established and there are more results available publicly I would only use it for offloading to Intel hardware.

Since new supercomputers with AMD ( Frontier and LUMI ) and Intel ( Aurora ) accelerators are already here or will be in the very near future, I expect a lot of comparisons between accelerators and portable programming models to be published as many HPC libraries and applications will need to support accelerators from all vendors.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM