网捷达

The NVIDIA CUDA Compiler (NVCC) is the specialized driver used to transform CUDA C++ source code into executable programs. Because CUDA programs are "heterogeneous" meaning they contain code for both a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) the compilation process is more complex than standard C++ development.

CUDA Compilation Process

A CUDA source file (ending in .cu) contains two types of code: Host code (runs on the CPU) and Device code (runs on the GPU). NVCC acts as a compiler driver that coordinates the compilation of both parts simultaneously.

How it Works:

Code Splitting: NVCC reads the .cu file and separates the standard C++ code from the GPU kernels (marked with __global__).
Host Compilation: The CPU code is passed to a standard host compiler (like gcc on Linux or cl.exe on Windows) to create standard object code.
Device Compilation: The GPU code is compiled by NVIDIA’s tools into an assembly-like format called PTX (Parallel Thread Execution) or a binary format called cubin.
Linking: Finally, NVCC bundles both the CPU and GPU components into a single executable file. When the user runs this file, the CPU part starts first and "launches" the GPU part when needed.

Compiling via Command Line

1. Basic Command: To compile a source file into a runnable program, use the following syntax:

nvcc program.cu -o program

Explanation:

nvcc: Invokes the compiler driver.
program.cu: The source file to be compiled.
-o program: The "output" flag that defines the name of the final executable.

2. Targeting Specific Hardware (-arch): GPU architectures evolve with every generation (e.g., Pascal, Turing, Ampere). To get the best performance, the compiler needs to know which GPU generation you are targeting.

nvcc -arch=sm_75 program.cu -o program

Explanation: -arch=sm_xx stands for "Shader Model." For example, sm_75 targets Turing-generation GPUs (like the RTX 20-series or Tesla T4). This ensures the compiler uses the specific instructions available on that hardware.

3. Optimization and Debugging: One can pass flags to improve performance or help find errors in the code.

-O3: Enables high-level optimization for the CPU code to make it run faster.
-g -G: These flags add debugging information for both the CPU (-g) and the GPU (-G), allowing to use tools like cuda-gdb to find bugs.

4. Running the Program: Once compiled, resulting binary is a standalone file. On Linux or macOS, run it with ./, and on Windows, simply type the filename.

./program

Explanation: This starts the host code on the CPU. If the code includes a cudaDeviceSynchronize() call, CPU will wait for all GPU kernels and printf statements to finish before the program closes, ensuring you see all output in the terminal.

Compiling CUDA Programs (NVCC)

CUDA Compilation Process

How it Works:

Compiling via Command Line

Explore