CUDA Toolkit Changelog

What's new in CUDA Toolkit 7.0.29

Mar 18, 2015
  • CUDA Tools:
  • The CUDA-GDB debugger is deprecated on the Mac platform and will be removed from it in the next release of the CUDA Toolkit. Furthermore, the CUDA-GDB tool included in CUDA 7.0 has several known issues in single-GPU Mac Pro configurations and on the MacBook and iMac platforms.
  • Resolved Issues:
  • General CUDA - On POWER8 systems, users should add the /usr/lib/powerpc64le-linux- gnu/mesa path to the LD_LIBRARY_PATH environment variable in order to use the Mesa GL libraries.
  • This information replaces this General CUDA Known Issue: "The POWER8 driver installation incorrectly overrides the Mesa GL alternative. To access the Mesa GL libraries, manually select the Mesa GL alternative by running the following command:sudo update-alternatives --config powerpc64le-linux- gnu_gl_conf."

New in CUDA Toolkit 6.5.14 (Aug 20, 2014)

  • 64-bit ARM-based systems
  • Microsoft Visual Studio 2013 (VC12)
  • BSR sparse matrix format in cuSPARSE routines
  • Using cuFFT callbacks for higher performance custom processing on input or output data
  • Improved debugging for CUDA FORTRAN applications
  • Application Replay mode in both the Visual Profiler and command line cudaprof
  • Updated CUDA Occupancy Calculator API provides optimal kernel launch configurations
  • New “nvprune” utility to remove portions of object files for specified GPU architectures

New in CUDA Toolkit 5.5.20 (Aug 1, 2013)

  • General CUDA:
  • MPS (Multi-Process Service) is a runtime service designed to let multiple MPI (Message Passing Interface) processes using CUDA run concurrently on a single GPU in a way that's transparent to the MPI program. A CUDA program runs in MPS mode if the MPS control daemon is running on the system. When a CUDA program starts, it connects to the MPS control daemon (if possible), which then creates an MPS server for the connecting client if one does not already exist for the user (UID) that launched the client.
  • With the CUDA 5.5 Toolkit, there are some restrictions that are now enforced that may cause existing projects that were building on CUDA 5.0 to fail. For projects that use -Xlinker with nvcc, you need to ensure the arguments after -Xlinker are quoted. In CUDA 5.0, -Xlinker -rpath /usr/local/cuda/lib would succeed; in CUDA 5.5 -Xlinker "-rpath /usr/local/cuda/lib" is now necessary.
  • The CUDA Sample projects have makefiles that are now more self-contained and robust.
  • The following documents are now available in the CUDA toolkit documentation portal:
  • Programming guides: CUDA Video Encoder, CUDA Video Decoder, Developer Guide to Optimus, Parallel Thread Execution (PTX) ISA, Using Inline PTX Assembly in CUDA, NPP Library Programming Guide
  • Tools manuals:
  • CUDA Binary Utilities
  • White papers:
  • Floating-Point and IEEE 754 Compliance, Incomplete-LU and Cholesky
  • Preconditioned Iterative Methods
  • Compiler SDK:
  • libNVVM API, libdevice Users's Guide, NVVM IR Specification
  • General:
  • CUDA Toolkit Release Notes, End-User License Agreements
  • CUDA Libraries:
  • CUBLAS:
  • The routines cublas{S,D,C,Z}getriBatched() and cublas{S,D,C,Z}matinvBatched() have been added to the CUBLAS Library.
  • Routine cublas{S,D,C,Z}getriBatched() must be called after the LU batched
  • factorization routine, cublas{S,D,C,Z}getrfBatched(), to obtain the inverse
  • matrices. The routine cublas{S,D,C,Z}matinvBatched() does a direct inversion
  • with pivoting based on the Gauss-Jordan algorithm but is limited to matrices of dimension

New in CUDA Toolkit 5.5.12 RC (Jun 18, 2013)

  • Multi-process MPI debugging & profiling
  • Single GPU debugging for Linux
  • Step-by-step guided performance analysis
  • Static CUDA runtime library
  • RPM/DEB packaging & new NVIDIA repo

New in CUDA Toolkit 5.0.37 (Feb 20, 2013)

  • Support for MacOS 10.8 Mountain Lion

New in CUDA Toolkit 5.0.24 RC (Aug 17, 2012)

  • Nsight Eclipse Edition for Linux and Mac OS is an all-in-one development
  • environment that allows developing, debugging, and optimizing CUDA code in an
  • integrated UI environment.
  • A new command-line profiler, nvprof, provides summary information about where
  • applications spend the most time, so that optimization efforts can be properly
  • focused.
  • This release contains the following:
  • NVIDIA CUDA Toolkit documentation
  • NVIDIA CUDA compiler (NVCC) and supporting tools
  • NVIDIA CUDA runtime libraries
  • NVIDIA CUDA-GDB debugger
  • NVIDIA CUDA-MEMCHECK
  • NVIDIA Visual Profiler, nvprof, and command-line profiler
  • NVIDIA Nsight Eclipse Edition
  • NVIDIA CUBLAS, CUFFT, CUSPARSE, CURAND, Thrust, and
  • NVIDIA Performance Primitives (NPP) libraries

New in CUDA Toolkit 4.2.9 (Apr 19, 2012)

  • Enables development using GPUs using the Kepler architecture, such as the GeForce GTX680.

New in CUDA Toolkit 4.1.28 (Jan 27, 2012)

  • Features a new LLVM-based CUDA compiler, 1000+ new image processing functions, and a redesigned Visual Profiler with automated performance analysis and integrated expert guidance.

New in CUDA Toolkit 4.1.21 RC 2 (Dec 6, 2011)

  • Try the new compiler:
  • New LLVM-based compiler delivers up to 10% faster performance for many applications
  • New & Improved “drop-in” acceleration with GPU-Accelerated Libraries:
  • Over 1000 new image processing functions in the NPP library
  • New cuSPARSE tri-diagonal solver up to 10x faster than MKL on a 6 core CPU
  • New support in cuRAND for MRG32k3a and Mersenne Twister (MTGP11213) RNG algorithms
  • Bessel functions now supported in the CUDA standard Math library
  • Up to 2x faster sparse matrix vector multiply using ELL hybrid format
  • Learn more about all the great GPU-Accelerated Libraries
  • Enhanced & Redesigned Developer Tools:
  • Redesigned Visual Profiler with automated performance analysis and expert guidance
  • CUDA_GDB support for multi-context debugging and assert() in device code
  • CUDA-MEMCHECK now detects out of bounds access for memory allocated in device code
  • Parallel Nsight 2.1 CUDA warp watch visualizes variables and expressions across an entire CUDA warp
  • Parallel Nsight 2.1 CUDA profiler now analyzes kernel memory activities, execution stalls and instruction throughput
  • Learn more about debugging and performance analysis tools for GPU developers on our CUDA Tools and Ecosystem Summary Page
  • Advanced Programming Features:
  • Access to 3D surfaces and cube maps from device code
  • Enhanced no-copy pinning of system memory, cudaHostRegister() alignment and size restrictions removed
  • Peer-to-peer communication between processes
  • Support for resetting a GPU without rebooting the system in nvidia-smi
  • New & Improved SDK Code Samples:
  • simpleP2P sample now supports peer-to-peer communication with any Fermi GPU
  • New grabcutNPP sample demonstrates interactive foreground extraction using iterated graph cuts
  • New samples showing how to implement the Horn-Schunck Method for optical flow, perform volume filtering, and read cube map texture

New in CUDA Toolkit 4.0.19 (Jul 7, 2011)

  • Easier Application Porting:
  • Share GPUs across multiple threads
  • Use all GPUs in the system concurrently from a single host thread
  • No-copy pinning of system memory, a faster alternative to cudaMallocHost()
  • C++ new/delete and support for virtual functions
  • Support for inline PTX assembly
  • Thrust library of templated performance primitives such as sort, reduce, etc.
  • NVIDIA Performance Primitives (NPP) library for image/video processing
  • Layered Textures for working with same size/format textures at larger sizes and higher performance
  • Faster Multi-GPU Programming:
  • Unified Virtual Addressing
  • GPUDirect v2.0 support for Peer-to-Peer Communication
  • New & Improved Developer Tools:
  • Automated Performance Analysis in Visual Profiler
  • C++ debugging in CUDA-GDB for Linux and MacOS
  • GPU binary disassembler for Fermi architecture (cuobjdump)
  • Parallel Nsight 2.0 now available for Windows developers with new debugging and profiling features.

New in CUDA Toolkit 3.2.17 (Nov 23, 2010)

  • New and Improved CUDA Libraries:
  • CUBLAS performance improved 50% to 300% on Fermi architecture GPUs, for matrix multiplication of all datatypes and transpose variations
  • CUFFT performance tuned for radix-3, -5, and -7 transform sizes on Fermi architecture GPUs, now 2x to 10x faster than MKL
  • New CUSPARSE library of GPU-accelerated sparse matrix routines for sparse/sparse and dense/sparse operations delivers 5x to 30x faster performance than MKL
  • New CURAND library of GPU-accelerated random number generation (RNG) routines, supporting Sobol quasi-random and XORWOW pseudo-random routines at 10x to 20x faster than similar routines in MKL
  • H.264 encode/decode libraries now included in the CUDA Toolkit
  • CUDA Driver & CUDA C Runtime:
  • Support for new 6GB Quadro and Tesla products
  • New support for enabling high performance Tesla Compute Cluster (TCC) mode on Tesla GPUs in Windows desktop workstations
  • Development Tools:
  • Multi-GPU debugging support for both cuda-gdb and Parallel Nsight
  • Expanded cuda-memcheck support for all Fermi architecture GPUs
  • NVCC support for Intel C Compiler (ICC) v11.1 on 64-bit Linux distros
  • Support for debugging GPUs with more than 4GB device memory
  • Miscellaneous:
  • Support for memory management using malloc() and free() in CUDA C compute kernels
  • New NVIDIA System Management Interface (nvidia-smi) support for reporting % GPU busy, and several GPU performance counters
  • New GPU Computing SDK Code Samples:
  • Several code samples demonstrating how to use the new CURAND library, including MonteCarloCURAND, EstimatePiInlineP, EstimatePiInlineQ, EstimatePiP, EstimatePiQ, SingleAsianOptionP, and randomFog
  • Conjugate Gradient Solver, demonstrating the use of CUBLAS and CUSPARSE in the same application
  • Function Pointers, a sample that shows how to use function pointers to implement the Sobel Edge Detection filter for 8-bit monochrome images
  • Interval Computing, demonstrating the use of interval arithmetic operators using C++ templates and recursion
  • Simple Printf, demonstrating best practices for using both printf and cuprintf in compute kernels
  • Bilateral Filter, an edge-preserving non-linear smoothing filter for image recovery and denoising implemented in CUDA C with OpenGL rendering
  • SLI with Direct3D Texture, a simple example demonstrating the use of SLI and Direct3D interoperability with CUDA C
  • cudaEncode, showing how to use the NVIDIA H.264 Encoding Library using YUV frames as input
  • Vflocking Direct3D/CUDA, which simulates and visualizes the flocking behavior of birds in flight
  • simpleSurfaceWrite, demonstrating how CUDA kernels can write to 2D surfaces on Fermi GPUs

New in CUDA Toolkit 3.0.14 (Apr 30, 2010)

  • New Features:
  • Added support for Snow Leopard
  • Function Attributes added: PTX_VERSION, BINARY_VERSION
  • Device Attributes added: MAXIMUM_TEXTURE, SURFACE_ALIGNMENT, CONCURRENT_KERNELS
  • New API Features:
  • Float16 (half) textures are supported in the runtime:
  • cudaCreateChannelDescHalf family of functions supports it in C++ style API or proper channel could be crated via cudaCreateChannelDesc in C style level API
  • users should be aware that halves are promoted to floats during computation and therefore, only floats could be fetched by texture fetch functions
  • users could use intrinsics in device code to convert between fp16 and fp32 data
  • Double3 and double4 vector types are supported in the runtime:
  • This breaks code when users had already added these themselves.
  • One dimensional device-device copies now support streams:
  • cudaMemcpyAsync now applies the stream parameter for cudaMemcpyDeviceToDevice as well
  • cuMemcpyDtoDAsync
  • Support for ELF binaries:
  • ELF is generated by default by nvcc. For ptxas or fatbin, the -elf option is required.
  • Cubins are now binary files. Do not assume that they are ASCII text.
  • Testing applications for Fermi-readiness:
  • Setting the env variable CUDA_FORCE_PTX_JIT to 1 will disable all non-PTX user kernels from being able to load. If your application fails to run, you are not compiling with PTX. Please see the programming guide for more information about compiling for different compute capabilities.
  • OpenGL texture interoperation:
  • Concurrent Kernels:
  • Kernels launched within different non-NULL streams may now overlap with each other if they are able to simultaneously fit on the device. The ability of a device to run multiple kernels concurrently can be queried via the CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS device attribute. See the 3.0 programming guide for using this feature.
  • Batched 2D & 3D transforms are now supported in CUFFT, using the new cufftPlanMany() API
  • New Toolkit Features:
  • nvcc:
  • The command line option --host-compilation=C is no more. nvcc emits a warning and switches back to C++. This option will eventually disappear altogether
  • Windows DLL Naming Conventions:
  • Each DLL now specifies the machine type, the toolkit version number, and the build number in its filename.
  • For example, cudart32_30_4.dll would be the 32-bit build of 3.0 Cudart with a build number of 4.
  • The build number of the final release will always be greater than the build number of the beta release.
  • The corresponding .lib files do not have any extra naming decoration, so you can continue linking your applications the same way.
  • Separate Library for Runtime Device Emulation:
  • Cudart has now been split up into two libraries. Cudartemu should be linked with for device emulation, similar to the way in which Cublasemu/Cufftemu were previously used.
  • CUBLAS Library Support :
  • On Fermi architecture (e.g sm_20), arithmetic is IEEE-754 compliant.
  • cublasStrmv and cublasDtrmv have been enhanced to remove the previous size limitation of the input vector.
  • On Tesla architecture, cublasZgemm performance has been improved to be similar to cublasDgemm.
  • Added the BLAS1 functions
  • Added the BLAS2 functions
  • Added the BLAS3 functions

New in CUDA Toolkit 2.3.1a (Feb 11, 2010)

  • The CUFFT Library now supports double-precision transforms and includes significant performance improvements for single-precision transforms as well. See the CUDA Toolkit release notes for details.
  • The cuda-gdb hardware debugger and CUDA Visual Profiler are now included in the CUDA Toolkit installer, and the CUDA-GDB debugger is now available for all supported Linux distros.
  • Each GPU in an SLI group is now enumerated individually, so compute applications can now take advantage of multi-GPU performance even when SLI is enabled for graphics.
  • The 64-bit versions of the CUDA Toolkit now support compiling 32-bit applications. Please note that the installation location of the libraries has changed, so developers on 64-bit Linux must update their LD_LIBRARY_PATH to contain either /usr/local/cuda/lib or /usr/local/cuda/lib64.
  • New support for fp16/fp32 conversion intrinsics allows storage of data in fp16 format with computation in fp32. Use of fp16 format is ideal for applications that require higher numerical range than 16-bit integer but less precision than fp32 and reduces memory space and bandwidth consumption.
  • The Visual Profiler includes several enhancements:
  • All memory transfer API calls are now reported
  • Support for profiling multiple contexts per GPU
  • Synchronized clocks for requested start time on the CPU and start/end times on the GPU for all kernel launches and memory transfers
  • Global memory load and store efficiency metrics for GPUs with compute capability 1.2 and higher
  • The CUDA Driver for MacOS now has it's own installer, and is available separate from the CUDA Toolkit.

New in CUDA Toolkit 2.3 (Jul 24, 2009)

  • New Features:
  • CUFFT Features: Performance enhancements: Double precision
  • CUFFT now supports double-precision transforms, with types and functions analagous to the existing single-precision versions. Similarly, the "cufftType" enumeration (used in calls like cufftPlan1d) has expanded to include double-precision identifiers.
  • The double-precision versions are invoked in an identical manner to the single-precision ones, obviously with arguments changed from the single- to the double-precision types. See "cufft.h" for exact definitions of the above.
  • Separate Packaging: CUDA Driver and CUDA Toolkit are now available via separate packages
  • Double Handling by the Compiler: when a ptx file with an sm version prior to sm_13 contains double precision instructions, ptxas now emits a warning that double precision instructions are demoted to single precision. ptxas has a new option --suppress-double-demote-warning to suppress this warning
  • Major Bug Fixes:
  • C++ Support for Device Emulation: Support is restored for using C++ code in device emulation mode

New in CUDA Toolkit 2.2 (May 11, 2009)

  • Visual Profiler for the GPU - The most common step in tuning application performance is profiling the application and then modifying the code. The CUDA Visual Profiler is a graphical tool that enables the profiling of C applications running on the GPU. This latest release of the CUDA Visual Profiler includes metrics for memory transactions, giving developers visibility into one of the most important areas they can tune to get better performance.
  • Improved OpenGL Interop - Delivers improved performance for Medical Imaging and other OpenGL applications running on Quadro GPUs when computing with CUDA and rendering OpenGL graphics functions are performed on different GPUs.
  • Texture from Pitch Linear Memory - Delivers up to 2x bandwidth savings for video processing applications.
  • Zero-copy - Enables streaming media, video transcoding, image processing and signal processing applications to realize significant performance improvements by allowing CUDA functions to read and write directly from pinned system memory. This reduces the frequency and amount of data copied back and forth between GPU and CPU memory. Supported on MCP7x and GT200 and later GPUs.
  • Pinned Shared Sysmem - Enables applications that use multiple GPUs to achieve better performance and use less total system memory by allowing multiple GPUs to access the same data in system memory. Typical multi-GPU systems include Tesla servers, Tesla Personal Supercomputers, workstations using QuadroPlex deskside units and consumer systems with multiple GPUs.
  • Asynchronous memcopy on Vista - Allows applications to realize significant performance improvements by copying memory asynchronously. This feature was already available on other supported platforms but is now available on Vista.
  • Hardware Debugger for the GPU - Developers can now use a hardware level debugger on CUDA-enabled GPUs that offers the simplicity of the popular open-source GDB debugger yet enables a developer to easily debug a program that is running 1000s of threads on the GPU. This CUDA GDB debugger for Linux has all the features required to debug directly on the GPU, including the ability to set breakpoints, watch variables, inspect state, etc.
  • Exclusive Device Mode - This system configuration option allows an application to get exclusive use of a GPU, guaranteeing that 100% of the processing power and memory of the GPU will be dedicated to that application. Multiple applications can still be run concurrently on the system, but only one application can make use of each GPU at a time. This configuration is particularly useful on Tesla cluster systems where large applications may require dedicated use of one or more GPUs on each node of a Linux cluster.
  • Pinned Memory Support:
  • These new memory management functions (cuMemHostAlloc() and cudaHostAlloc()) enable pinned memory to be made "portable" (available to all CUDA contexts), "mapped" (mapped into the CUDA address space), and/or "write combined" (not cached and faster for the GPU to access).
  • cuMemHostAlloc
  • cuMemHostGetDevicePointer
  • cudaHostAlloc
  • cudaHostGetDevicePointer
  • Function attribute query:
  • This function allows applications to query various function properties.
  • cuFuncGetAttribute
  • 2D Texture reads from pitch linear memory:
  • You can bind linear memory that you get from cuMemAlloc() or cudaMalloc() directly to a 2D texture. In previous releases, you were only able to bind cuArrayCreate() or cudaMallocArray() arrays to 2D textures.
  • cuTexRefSetAddress2D
  • cudaBindTexture2D
  • Flags for event creation:
  • Applications can now create events that use blocking synchronization.
  • cudaEventCreateWithFlags
  • New device management and context creation flags:
  • The function cudaSetDeviceFlags() allows the application to specify attributes such as mapping host memory and support for blocking synchronization.
  • cudaSetDeviceFlags
  • Improved runtime device management:
  • The runtime now defaults to attempting context creation on other devices in the system before returning any failure messages. The new call cudaSetValidDevices() allows the application to specify a list of acceptable devices for use.
  • cudaSetValidDevices
  • Driver/runtime version query functions:
  • Applications can now directly query version information about the underlying driver/runtime.
  • cuDriverGetVersion
  • cudaDriverGetVersion
  • cudaRuntimeGetVersion
  • New device attribute queries:
  • CU_DEVICE_ATTRIBUTE_INTEGRATED
  • CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY
  • CU_DEVICE_ATTRIBUTE_COMPUTE_MODE
  • Documentation:
  • Doxygen-generated and cross-referenced html, pdf, and man pages.
  • Runtime API
  • Driver API