HBeongpgpu

Parallel Programming for Everyone

CUDA (Compute Unified Device Architecture)

The graphics card that we use in our PC for gaming and visual enhancement has a Graphics Processing Unit (GPU) and some dedicated off-chip DRAM. GPUs in general have a highly parallel architecture and in particular some of NVIDIA’s GPUs have 240 cores per processor (compare this with modern CPUs: 2, 4 or 8 cores). With such a parallel architecture, GPUs provide excellent computational platform, not only for graphical applications but any application where we have significant data parallelism. For example one can accelerate virus scanning by off loading the virus-matching task on the GPU.  The GPUs thus are not limited to its use as a graphics engine but as parallel computing architecture capable of performing floating point operations at the rate of Tera bytes/s. People have realized the potential of GPUs for highly computational tasks, and have been working in general purpose computation on GPUs (GPGPU) for a long time. However, life before NVIDIA’s Compute Unified Device Architecture (CUDA) was extremely difficult for the programmer, since the programmers need to call graphics API (Open GL, Open MP, Open CV etc.). This also has a very slow learning rate. CUDA solved all these problems by providing a hardware abstraction, hiding the inner details of the GPUs, and the programmer is freed from the burden of learning graphics programming.

 

 

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CUDA is C language with some extensions for processing on GPUs. The user writes a C code, while the compiler bifurcates the code into two portions. One portion is delivered to CPU (because CPU is best for such tasks), while the other portion, involving extensive calculations, is delivered to the GPU(s), that executes the code in parallel. Because C is a familiar programming language, CUDA results in very steep learning curve and hence it is becoming a favorite tool for accelerating various applications. NVIDIA's CUDA SDK is being employed in a plethora of fields right from the computational finance to Neural network and fuzzy logic to simulations for Nanotechnology.

CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.
  • Scattered reads – code can read to arbitrary addresses in memory.
  • It is high level-basically an extension to C language. So the learning rate is much higher as compared to the traditional GPGPU.
  • Shared memory – CUDA exposes a fast-shared memory region (16KB in size) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.
  • Faster downloads and readbacks to and from the GPU
  • Full support for integer and bit wise operations
In short CUDA lets you exploit these tiny supercomputers i.e GPUs, that ships with your graphics cards, and lets you accelerate your applications significantly ,some time as fast as 100 times and even more depending upon how smartly you have exploited the resources of GPUs.

The following figure shows the processing flow of CUDA.


Example of CUDA processing flow
1. Copy data from main memory to GPU memory
2. CPU instructs the process to GPU
3. GPU execute parallel in each core
4. Copy the result from GPU memory to main memory
 

Above picture taken from wikipedia http://en.wikipedia.org/wiki/File:CUDA_processing_flow_(En).PNG, authored  by Tosaka

  

Why Learn CUDA ?

The era of parallel programming has already arrived. With the advent of multicore processors, such as Core2Duo, Quad Core etc., the softwares should now be written so as to exploit these resources (namely cores) as much as possible. The customers are demanding more and more exciting applications on their PCs, Laptops and on their portable gadgets. The users want better GUI (Graphics User Interface), HD quality video, faster virus scanners, real time network security systems, better realism in video games and faster access to data base. Moreover, the engineering and scientific community is, for example looking for deeper insights into the biological cells at molecular level. At such level microscopes are of no use, and thus only simulations involving calculations with hundreds of GFLOPS (Giga Floating Point Operations Per Second) can give valuable information. Thus there is a great pressure on the application designers to develop applications, (graphical or non-graphical i.e general) which should run many times faster than the present applications. The trend is such that today’s supercomputing applications will be tomorrow’s exciting applications, demanding more and more computational power. This is the reason why engineers, scientists and software developers across the globe are switching to parallel programming, writing code that executes simultaneously on multiple cores, in a multi core processor.      

Thanks to hundreds of cores in NVIDIA's modern GPUs and software architecture CUDA the world's first C compiler for GPUs, one can think of exploiting these valuable resources (i.e 100s of processor cores) and develop applications (graphical or non-graphical) that run 100 times and even faster. Thus putting the smile on the customer’s face by accelerating the applications such as virus scanners, video games, Image processing tools, network security systems, video editing tools and scientific simulations etc.

Above all the compatibility with the C programming language turns the learning curve very steep, and the hardware abstraction provided by CUDA makes the programmer’s life easier than ever before. The programmer need not aware of the graphics APIs (e.g, OpenGL) and can use C programming language to launch thousands of threads running in parallel on hundred’s of cores.          

The speed of the GPU is increasing at a a much higher rate as compared to the CPU (see below) making the GPUs as a co-processor for handling large number of calculations per second demanded by the customers. 

That is why NVIDIA says that CUDA would be the jazziest thing to possess in 2010.