CUDA Toolkit Changelog

What's new in CUDA Toolkit 7.0.29

Mar 18, 2015

CUDA Tools:
The CUDA-GDB debugger is deprecated on the Mac platform and will be removed from it in the next release of the CUDA Toolkit. Furthermore, the CUDA-GDB tool included in CUDA 7.0 has several known issues in single-GPU Mac Pro configurations and on the MacBook and iMac platforms.
Resolved Issues:
General CUDA - On POWER8 systems, users should add the /usr/lib/powerpc64le-linux- gnu/mesa path to the LD_LIBRARY_PATH environment variable in order to use the Mesa GL libraries.
This information replaces this General CUDA Known Issue: "The POWER8 driver installation incorrectly overrides the Mesa GL alternative. To access the Mesa GL libraries, manually select the Mesa GL alternative by running the following command:sudo update-alternatives --config powerpc64le-linux- gnu_gl_conf."

New in CUDA Toolkit 6.5.14 (Aug 20, 2014)

New in CUDA Toolkit 5.5.20 (Aug 1, 2013)

General CUDA:
MPS (Multi-Process Service) is a runtime service designed to let multiple MPI (Message Passing Interface) processes using CUDA run concurrently on a single GPU in a way that's transparent to the MPI program. A CUDA program runs in MPS mode if the MPS control daemon is running on the system. When a CUDA program starts, it connects to the MPS control daemon (if possible), which then creates an MPS server for the connecting client if one does not already exist for the user (UID) that launched the client.
With the CUDA 5.5 Toolkit, there are some restrictions that are now enforced that may cause existing projects that were building on CUDA 5.0 to fail. For projects that use -Xlinker with nvcc, you need to ensure the arguments after -Xlinker are quoted. In CUDA 5.0, -Xlinker -rpath /usr/local/cuda/lib would succeed; in CUDA 5.5 -Xlinker "-rpath /usr/local/cuda/lib" is now necessary.
The CUDA Sample projects have makefiles that are now more self-contained and robust.
The following documents are now available in the CUDA toolkit documentation portal:
Programming guides: CUDA Video Encoder, CUDA Video Decoder, Developer Guide to Optimus, Parallel Thread Execution (PTX) ISA, Using Inline PTX Assembly in CUDA, NPP Library Programming Guide
Tools manuals:
CUDA Binary Utilities
White papers:
Floating-Point and IEEE 754 Compliance, Incomplete-LU and Cholesky
Preconditioned Iterative Methods
Compiler SDK:
libNVVM API, libdevice Users's Guide, NVVM IR Specification
General:
CUDA Toolkit Release Notes, End-User License Agreements
CUDA Libraries:
CUBLAS:
The routines cublas{S,D,C,Z}getriBatched() and cublas{S,D,C,Z}matinvBatched() have been added to the CUBLAS Library.
Routine cublas{S,D,C,Z}getriBatched() must be called after the LU batched
factorization routine, cublas{S,D,C,Z}getrfBatched(), to obtain the inverse
matrices. The routine cublas{S,D,C,Z}matinvBatched() does a direct inversion
with pivoting based on the Gauss-Jordan algorithm but is limited to matrices of dimension

New in CUDA Toolkit 5.5.12 RC (Jun 18, 2013)

New in CUDA Toolkit 5.0.37 (Feb 20, 2013)

New in CUDA Toolkit 5.0.24 RC (Aug 17, 2012)

New in CUDA Toolkit 4.2.9 (Apr 19, 2012)

New in CUDA Toolkit 4.1.28 (Jan 27, 2012)

New in CUDA Toolkit 4.1.21 RC 2 (Dec 6, 2011)

New in CUDA Toolkit 4.0.19 (Jul 7, 2011)

New in CUDA Toolkit 3.2.17 (Nov 23, 2010)

New and Improved CUDA Libraries:
CUBLAS performance improved 50% to 300% on Fermi architecture GPUs, for matrix multiplication of all datatypes and transpose variations
CUFFT performance tuned for radix-3, -5, and -7 transform sizes on Fermi architecture GPUs, now 2x to 10x faster than MKL
New CUSPARSE library of GPU-accelerated sparse matrix routines for sparse/sparse and dense/sparse operations delivers 5x to 30x faster performance than MKL
New CURAND library of GPU-accelerated random number generation (RNG) routines, supporting Sobol quasi-random and XORWOW pseudo-random routines at 10x to 20x faster than similar routines in MKL
H.264 encode/decode libraries now included in the CUDA Toolkit
CUDA Driver & CUDA C Runtime:
Support for new 6GB Quadro and Tesla products
New support for enabling high performance Tesla Compute Cluster (TCC) mode on Tesla GPUs in Windows desktop workstations
Development Tools:
Multi-GPU debugging support for both cuda-gdb and Parallel Nsight
Expanded cuda-memcheck support for all Fermi architecture GPUs
NVCC support for Intel C Compiler (ICC) v11.1 on 64-bit Linux distros
Support for debugging GPUs with more than 4GB device memory
Miscellaneous:
Support for memory management using malloc() and free() in CUDA C compute kernels
New NVIDIA System Management Interface (nvidia-smi) support for reporting % GPU busy, and several GPU performance counters
New GPU Computing SDK Code Samples:
Several code samples demonstrating how to use the new CURAND library, including MonteCarloCURAND, EstimatePiInlineP, EstimatePiInlineQ, EstimatePiP, EstimatePiQ, SingleAsianOptionP, and randomFog
Conjugate Gradient Solver, demonstrating the use of CUBLAS and CUSPARSE in the same application
Function Pointers, a sample that shows how to use function pointers to implement the Sobel Edge Detection filter for 8-bit monochrome images
Interval Computing, demonstrating the use of interval arithmetic operators using C++ templates and recursion
Simple Printf, demonstrating best practices for using both printf and cuprintf in compute kernels
Bilateral Filter, an edge-preserving non-linear smoothing filter for image recovery and denoising implemented in CUDA C with OpenGL rendering
SLI with Direct3D Texture, a simple example demonstrating the use of SLI and Direct3D interoperability with CUDA C
cudaEncode, showing how to use the NVIDIA H.264 Encoding Library using YUV frames as input
Vflocking Direct3D/CUDA, which simulates and visualizes the flocking behavior of birds in flight
simpleSurfaceWrite, demonstrating how CUDA kernels can write to 2D surfaces on Fermi GPUs

New in CUDA Toolkit 3.0.14 (Apr 30, 2010)

New Features:
Added support for Snow Leopard
Function Attributes added: PTX_VERSION, BINARY_VERSION
Device Attributes added: MAXIMUM_TEXTURE, SURFACE_ALIGNMENT, CONCURRENT_KERNELS
New API Features:
Float16 (half) textures are supported in the runtime:
cudaCreateChannelDescHalf family of functions supports it in C++ style API or proper channel could be crated via cudaCreateChannelDesc in C style level API
users should be aware that halves are promoted to floats during computation and therefore, only floats could be fetched by texture fetch functions
users could use intrinsics in device code to convert between fp16 and fp32 data
Double3 and double4 vector types are supported in the runtime:
This breaks code when users had already added these themselves.
One dimensional device-device copies now support streams:
cudaMemcpyAsync now applies the stream parameter for cudaMemcpyDeviceToDevice as well
cuMemcpyDtoDAsync
Support for ELF binaries:
ELF is generated by default by nvcc. For ptxas or fatbin, the -elf option is required.
Cubins are now binary files. Do not assume that they are ASCII text.
Testing applications for Fermi-readiness:
Setting the env variable CUDA_FORCE_PTX_JIT to 1 will disable all non-PTX user kernels from being able to load. If your application fails to run, you are not compiling with PTX. Please see the programming guide for more information about compiling for different compute capabilities.
OpenGL texture interoperation:
Concurrent Kernels:
Kernels launched within different non-NULL streams may now overlap with each other if they are able to simultaneously fit on the device. The ability of a device to run multiple kernels concurrently can be queried via the CU_DEVICE_ATTRIBUTE_CONCURRENT_KERNELS device attribute. See the 3.0 programming guide for using this feature.
Batched 2D & 3D transforms are now supported in CUFFT, using the new cufftPlanMany() API
New Toolkit Features:
nvcc:
The command line option --host-compilation=C is no more. nvcc emits a warning and switches back to C++. This option will eventually disappear altogether
Windows DLL Naming Conventions:
Each DLL now specifies the machine type, the toolkit version number, and the build number in its filename.
For example, cudart32_30_4.dll would be the 32-bit build of 3.0 Cudart with a build number of 4.
The build number of the final release will always be greater than the build number of the beta release.
The corresponding .lib files do not have any extra naming decoration, so you can continue linking your applications the same way.
Separate Library for Runtime Device Emulation:
Cudart has now been split up into two libraries. Cudartemu should be linked with for device emulation, similar to the way in which Cublasemu/Cufftemu were previously used.
CUBLAS Library Support :
On Fermi architecture (e.g sm_20), arithmetic is IEEE-754 compliant.
cublasStrmv and cublasDtrmv have been enhanced to remove the previous size limitation of the input vector.
On Tesla architecture, cublasZgemm performance has been improved to be similar to cublasDgemm.
Added the BLAS1 functions
Added the BLAS2 functions
Added the BLAS3 functions

New in CUDA Toolkit 2.3.1a (Feb 11, 2010)

New in CUDA Toolkit 2.3 (Jul 24, 2009)

New in CUDA Toolkit 2.2 (May 11, 2009)

Visual Profiler for the GPU - The most common step in tuning application performance is profiling the application and then modifying the code. The CUDA Visual Profiler is a graphical tool that enables the profiling of C applications running on the GPU. This latest release of the CUDA Visual Profiler includes metrics for memory transactions, giving developers visibility into one of the most important areas they can tune to get better performance.
Improved OpenGL Interop - Delivers improved performance for Medical Imaging and other OpenGL applications running on Quadro GPUs when computing with CUDA and rendering OpenGL graphics functions are performed on different GPUs.
Texture from Pitch Linear Memory - Delivers up to 2x bandwidth savings for video processing applications.
Zero-copy - Enables streaming media, video transcoding, image processing and signal processing applications to realize significant performance improvements by allowing CUDA functions to read and write directly from pinned system memory. This reduces the frequency and amount of data copied back and forth between GPU and CPU memory. Supported on MCP7x and GT200 and later GPUs.
Pinned Shared Sysmem - Enables applications that use multiple GPUs to achieve better performance and use less total system memory by allowing multiple GPUs to access the same data in system memory. Typical multi-GPU systems include Tesla servers, Tesla Personal Supercomputers, workstations using QuadroPlex deskside units and consumer systems with multiple GPUs.
Asynchronous memcopy on Vista - Allows applications to realize significant performance improvements by copying memory asynchronously. This feature was already available on other supported platforms but is now available on Vista.
Hardware Debugger for the GPU - Developers can now use a hardware level debugger on CUDA-enabled GPUs that offers the simplicity of the popular open-source GDB debugger yet enables a developer to easily debug a program that is running 1000s of threads on the GPU. This CUDA GDB debugger for Linux has all the features required to debug directly on the GPU, including the ability to set breakpoints, watch variables, inspect state, etc.
Exclusive Device Mode - This system configuration option allows an application to get exclusive use of a GPU, guaranteeing that 100% of the processing power and memory of the GPU will be dedicated to that application. Multiple applications can still be run concurrently on the system, but only one application can make use of each GPU at a time. This configuration is particularly useful on Tesla cluster systems where large applications may require dedicated use of one or more GPUs on each node of a Linux cluster.
Pinned Memory Support:
These new memory management functions (cuMemHostAlloc() and cudaHostAlloc()) enable pinned memory to be made "portable" (available to all CUDA contexts), "mapped" (mapped into the CUDA address space), and/or "write combined" (not cached and faster for the GPU to access).
cuMemHostAlloc
cuMemHostGetDevicePointer
cudaHostAlloc
cudaHostGetDevicePointer
Function attribute query:
This function allows applications to query various function properties.
cuFuncGetAttribute
2D Texture reads from pitch linear memory:
You can bind linear memory that you get from cuMemAlloc() or cudaMalloc() directly to a 2D texture. In previous releases, you were only able to bind cuArrayCreate() or cudaMallocArray() arrays to 2D textures.
cuTexRefSetAddress2D
cudaBindTexture2D
Flags for event creation:
Applications can now create events that use blocking synchronization.
cudaEventCreateWithFlags
New device management and context creation flags:
The function cudaSetDeviceFlags() allows the application to specify attributes such as mapping host memory and support for blocking synchronization.
cudaSetDeviceFlags
Improved runtime device management:
The runtime now defaults to attempting context creation on other devices in the system before returning any failure messages. The new call cudaSetValidDevices() allows the application to specify a list of acceptable devices for use.
cudaSetValidDevices
Driver/runtime version query functions:
Applications can now directly query version information about the underlying driver/runtime.
cuDriverGetVersion
cudaDriverGetVersion
cudaRuntimeGetVersion
New device attribute queries:
CU_DEVICE_ATTRIBUTE_INTEGRATED
CU_DEVICE_ATTRIBUTE_CAN_MAP_HOST_MEMORY
CU_DEVICE_ATTRIBUTE_COMPUTE_MODE
Documentation:
Doxygen-generated and cross-referenced html, pdf, and man pages.
Runtime API
Driver API