Part C - OpenCL Programming Model

OpenCL Preliminaries

Introduce Khronos' OpenCL standard
Compare OpenCL syntax with CUDA syntax
Describe the details of device management in OpenCL

Introduction | Platform | Execution | Kernel | Example | Exercises


OpenCL is an open, royalty-free programming standard for heterogeneous computing published by the Khronos group.  The group manages numerous open standards for authoring and acceleration of parallel computing, graphics, dynamic media, computer vision and sensor processing on a wide variety of platforms.  Khronos defines and manages these standards, but does not implement them.  Manufacturers are free to implement the standards in their own ways and those who comply can identify their products as Khronos standard compliant. 

OpenCL defines the language conventions for compute-bound, concurrent programming on platforms with single and multiple devices.  The platforms may be either heterogeneous or homogeneous.  OpenCL, like CUDA, provides a host-side layer for accessing devices and a set of languange extensions for coding device-side instrustions.  Unlike CUDA, OpenCL is platform independent and does not require CUDA-enabled hardware from Nvidia. 

This chapter describes the abstract models that OpenCL defines to identify and manage the supporting hardware, compares the OpenCL terminology with that of CUDA and includes a first example of a complete OpenCL program.


Introduction

In 2008 Apple transferred OpenCL to the Khronos group for joint GPGPU development and platform independence.  OpenCL, like CUDA, evolved from concerted efforts to expose the compute capabilities present in computer graphics hardware.  OpenCL achieves its platform independence by implementing an abstract hardware model of the installed devices.  Like with CUDA, a single host coordinates the execution of GPGPU programs on a heterogeneous foundation.

OpenCL is currently at version 2.0.  It is available from

  • AMD OpenCL 2.0 Driver for Windows and Linux
  • Intel as part of its SDK for Windows 7 and 8
  • Nvidia in version 1.1 as part of the CUDA Toolkit

Since CUDA and OpenCL are both solutions that expose existing compute capabilities of graphics hardware, many OpenCL terms have direct counterparts in CUDA.

The Framework

The OpenCL framework consists of four models:

  1. the platform model
  2. the execution model
  3. the memory model
  4. the programming model

We cover the first two here and the last two in the following chapter.

Conventions

OpenCL's naming convention for its symbolic identifiers helps minimize naming conflicts with non-OpenCL code. 

Identifiers

OpenCL attaches the prefix cl_ to all identifiers.  Examples of data type identifiers are listed below

OpenCL Type Equivalent C Built-In Type Description
cl_bool bool boolean
cl_int int signed integer
cl_uint unsigned int unsigned 32-bit integer
cl_long long signed 64-bit integer
cl_ulong unsigned long unsigned 64-bit integer

OpenCL attaches the prefix cl to its API function identifiers.

Macros and Enumeration Constants

OpenCL attaches the prefix CL_ to its macros and enumeration constants.  For example,

 Symbol  Description
 CL_TRUE  true
 CL_SUCCESS   success - no errors encountered
 CL_DEVICE_TYPE_GPU   devices of GPU type
 CL_DEVICE_TYPE_CPU   devices of CPU type
 CL_DEVICE_TYPE_ALL   devices of any type
 CL_MEM_READ_ONLY   read only memory buffer
 CL_MEM_WRITE_ONLY   write only memory buffer

Error Handling

OpenCL API functions communicate error codes through either a return value or a returning argument.  The error code's type is cl_int and a value equal to CL_SUCCESS identifies success.  Other value, which identify errors, have standard enumeration constants, which are listed and described in the documentation. 

 cl_int status = CL_SUCCESS;

 // error code as a return value
 status = cl...(...);

 // error code as a returning argument 
 cl...(..., &status);

Platform Model

The platform model identifies the processor that coordinates the execution (the host) and the processor(s) that execute(s) the OpenCL code (the device(s)).  Each device consists of an array of compute units (CUs), which are independent of one another, as illustrated in the figure below.  Each compute unit consists of its own set of processing elements (PEs). 

OpenCL Platform Model

The CUDA terms that correspond to the elements of the platform model are listed below.

 OpenCL  CUDA
 Host  Host
 Device  Device
 Compute Units (CUs)  Streaming Multi-Processors (SMs) 
 Processing Elements (PEs)   Streaming Processors (SPs) 

Interrogation Phase

The interrogation phase of an OpenCL application determines the supporting platform and the installed devices through a series of OpenCL API calls. 

Supporting Platform

The clGetPlatformIDs() API function returns the number of platforms and information on each platform through its parameter list:

 cl_int                                      // error status
 clGetPlatformIDs(
     cl_uint num_entries,       // specify number of platforms 
     cl_platform_id* platforms, // array of platform data
     cl_uint* num_platforms)    // return number of platforms

We invoke this function in at least two separate calls.  The first call determines the number of platforms.  Each subsequent call retrieves the information for a specific platform:

 cl_int  status           = CL_SUCCESS;
 cl_uint numPlatforms     = 0;    // initialize for the first call
 cl_platform_id* platform = NULL; // initialize for the first call

 // retrieve the number of platforms
 status = clGetPlatformIDs(0, platform, &numPlatforms);

 // allocate memory for platform data
 platform = (cl_platform_id*)malloc(numPlatforms * sizeof(cl_platform_id)); 

 // populate platform with platform data
 status =  clGetPlatformIDs(numPlatforms, platform, NULL); 

Installed Devices

The clGetDeviceIDs() API function returns the number of devices and information on each device:

 cl_uint
 clGetDeviceIDs(
     cl_platform_id platform,   // platform
     cl_device_type deviceType, // device type
     cl_uint numEntries,        // number of devices
     cl_device_id* device,      // device information 
     cl_uint* numDevices)       // number found

The cl_device_type enumeration constant specifies the type of devices sought

  • CL_DEVICE_TYPE_GPU GPUs only
  • CL_DEVICE_TYPE_CPU CPUs only
  • CL_DEVICE_TYPE_ALL both CPUs and GPUs

We invoke this function in at least two separate calls.  The first determines the number of devices.  Each subsequenct call retrieves the information for a specific device:

 cl_int status        = CL_SUCCESS; // holds error code
 cl_uint numDevices   = 0;          // initialize for the first call
 cl_device_id* device = NULL;       // initialize for the first call

 // retrieve the number of devices
 status = clGetDeviceIDs(platform[ip], CL_DEVICE_TYPE_ALL, 0, device,
                         &numdevices);

 // allocate memory for device information
 device = (cl_device_id*)malloc(numDevices * sizeof(cl_device_id));

 // populate device with device information
 status = clGetDeviceIDs(platform[ip], CL_DEVICE_TYPE_ALL, numDevices, 
                         device, NULL);

Summary

The platform model (through the two API functions described above) identifies the hardware on which an OpenCL application is running.  The output from the API functions describes the properties of that hardware.


The Execution Model

The execution model defines the configuration environment on the host and the devices.  This definition includes setting up the context on the host, managing the interaction between the host and the device(s), and specifying the execution configuration for instructions that run on the device(s). 

The execution model includes three abstractions:

  1. context
  2. command queues
  3. buffer objects

The relation of these components to the platform model is illustrated in the figure below.

OpenCL context Model

Context

The context is the abstract container that the OpenCL API functions use to manage memory and the instructions that execute on the devices.  The context contains all of the installed devices.

CreateContext

The clCreateContext() API function creates an OpenCL context for the set of installed devices:

 cl_context
 clCreateContext(
     const cl_context_properties* properties, // special properties
     cl_unit numDevices,                      // number of devices
     const cl_device_id* device,              // devices properties
     void(CL_CALLBACK* pfn_notify)(           // callback function
         const char* errinfo,
         const void* privateInfo,
         size_t cb,
         void* userData),
     void* userData,                          // used in callback
     cl_int* errcodeRet)                      // returning error code 

The first argument specifies any special properties of the context, the second argument specifies the number of devices, the third argument is the address of the array that holds the device properties, the fourth argument is the address of a programmer-defined callback function for error handling, the fifth argument is the address of an variable used in the callback function, and the sixth argument is the address of the status variable that collects the error code.  The callback function reports error information generated throughout the lifetime of the context.

A call to clCreateContext() returns a handle to the context for the specified installed devices.

 cl_int status = CL_SUCCESS; // holds error code
 cl_context context = NULL;  // handle to the context

 // the context encompasses numDevice devices
 context = clCreateContext(NULL, numDevices, device, NULL, NULL, &status); 

CreateContextFromType

OpenCL also defines a composite function for creating a context without providing a list of installed devices:

 cl_context
 clCreateContextFromType(
     const cl_context_properties* properties, // special properties
     cl_device_type deviceType,               // type of device
     void(CL_CALLBACK* pfn_notify)(           // callback function
         const char* errinfo,
         const void* privateInfo,
         size_t cb,
         void* userData),
     void* userData,                          // used in callback
     cl_int* errcodeRet)                      // returning error code 

The first argument specifies any special properties of the context, the second argument specifies the device type, the third argument is the address of a programmer-defined callback function for error handling, the fourth argument is the address of an variable used in the callback function, and the fifth argument is the address of the status variable that receives the error code.  The callback function reports error information generated throughout the lifetime of the context.

A call to clCreateContextFromType() returns a handle to the context for the specified installed devices.  Two or more calls to clGetContextInfo() retrieve the number of installed devices and information about each device.

 cl_int status = CL_SUCCESS; // holds error code
 cl_context context; // handle to the context

 // the context encompasses all devices available in the system
 context = clCreateContextFromType(0, CL_DEVICE_TYPE_ALL, NULL, NULL, 
                                   &status);

 cl_uint numDevices   = 0;    // initialize for first call
 cl_device_id* device = NULL; // initialize for first call

 // retrieve the number of devices in the system
 status  = clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, device, &numDevices); 

 // allocate memory for the buffer for the set of devices
 device = (cl_device_id*)malloc(numDevices);

 // retrieve the information for each device in the system
 status  = clGetContextInfo(context, CL_CONTEXT_DEVICES, numDevices, device,
                            NULL);

Command Queue

The host communicates with the installed devices through their command queues.  The host passes instructions to any device through that device's command queue.  One command queue exists for each installed device.  We create a command queue for each selected device and use the queue to perform memory transfers and program operations on the device. 

The clCreateCommandQueue() API function creates a command queue for a device:

 cl_int status = CL_SUCCESS; // holds error code
 cl_command_queue
 clCreateCommandQueue(
     cl_context context,                     // context
     cl_device_id device,                    // device ID
     cl_command_queue_properties properties, // special properties
     cl_int* errcode_ret)                    // returning error code 

The first argument identifies the command queue's context, the second argument identifies the device, the third argument specifies any special properties of the queue, and the fourth argument is the address of the variable that receives the error code.

A call to clCreateCommandQueue returns a handle to the command queue:

 cl_int status = CL_SUCCESS; // holds error code
 cl_command_queue cmdQueue;

 cmdQueue = clCreateCommandQueue(context, device[0], 0, &status);

Buffer Objects

OpenCL accesses device memory through buffer objects.  Elements of a buffer object can be scalars, vectors or user-defined structures.  Elements are stored sequentially and are accessed using pointers. 

The clCreateBuffer() API function creates a buffer object:

 cl_mem
 clCreateBuffer(
     cl_context context,  // context for the buffer
     cl_mem_flags flags,  // usage flags
     size_t size,         // size in bytes
     void* host_ptr,      // pointer to buffer data already allocated 
     cl_int* errcode_ret) // returning error status

The first argument identifies the buffer object's context, the second argument specifies usage, the third argument defines the object's size in bytes, the fourth argument is the address of data already allocated, and the fifth argument is the address of the variable that receives the error code.

A call to this API function returns the object's address:

 cl_int status = CL_SUCCESS; // holds error code
 cl_mem bufferIn;
 cl_mem bufferOut;

 bufferIn  = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL,
                            &status);
 bufferOut = clCreateBuffer(context, CL_MEM_WRITE_ONLY, datasize, NULL, 
                            &status);

Transfer From Host to Device

OpenCL transfers data from host to device memory through the device's command queue.  Elements are stored on the device in the same order as on the host. 

The clEnqueueWriteBuffer() API function copies host data to the specified buffer object:

 cl_int
 clEnqueueWriteBuffer(
     cl_command_queue cmdQueue,     // device's command queue
     cl_mem buffer,                 // buffer object
     cl_bool blockingWrite,         // blocking or not
     size_t offset,                 // offset in the buffer
     size_t cb,                     // bytes to be copied
     const void* ptr,               // pointer to host memory 
     cl_uint numEventsInWaitList,   // outstanding events
     const cl_event* eventWaitList, //
     cl_event* event)               //

A call to this API function returns the error status identifing the success or failure of the copy operation:

 cl_int status = CL_SUCCESS; // holds error code

 status = clEnqueueWriteBuffer(cmdQueue, bufferIn, CL_FALSE, 0, datasize, 
                               h_a, 0, NULL, NULL);

Transfer From Device to Host

OpenCL transfers data from device to host memory through the device's command queue.  Elements are stored on the host in the same order as on the device. 

The clEnqueueReadBuffer() API function copies data from the specified buffer object to host memory:

 cl_int
 clEnqueueReadBuffer(
     cl_command_queue cmdQueue,      // device's command queue
     cl_mem buffer,                  // buffer object
     cl_bool blockingWrite,          // blocking or not
     size_t offset,                  // offset in the buffer
     size_t cb,                      // bytes to be copied
     const void* ptr,                // pointer to host memory 
     cl_uint numEventsInWaitList,    // outstanding events
     const cl_event* eventWaitList,  //
     cl_event* event)                //

A call to this API function returns the error status that identifies the success or failure of the copy:

 cl_int status = CL_SUCCESS; // holds error code

 status = clEnqueueReadBuffer(cmdQueue, bufferOut, CL_TRUE, 0, datasize, 
                              h_c, 0, NULL, NULL);

Program

OpenCL defines the full set of instructions that execute on the installed devices as the program object.  A program object consists of one or more kernels.  Each kernel is a unit of execution scheduled to run independently on a device. 

We create an executable program object in three steps:

  1. define a character string containing the source code for all of the kernels
  2. translate the character string into a program object
  3. compile the program object into a binary

The relation of the program object and its kernels to the context and the devices' command queues is illustrated in the Figure below.

OpenCL Program Model

Create the Program Object

The clCreateProgramWithSource() API function creates a program object from a character string.  The string contains the source code for all of the kernels in the program object:

 cl_program
 clCreateProgramWithSource(
     cl_context context,   // context
     cl_uint count,        // number of pointers to source code strings 
     const char** string,  // array of string pointers
     const size_t* length, // array of string sizes in bytes
     cl_int* errcode_ret)  // returning error code

A NULL address for the array of string sizes identifies the strings as null-terminated strings.

A call to this API function returns the handle to the program object:

 cl_int status = CL_SUCCESS; // holds error code
 cl_program program = NULL;
 const int NS = 1;
 const char* str = "...source code...";
 char* source[NS];
 source[0] = str;

 program = clCreateProgramWithSource(context, NS, (const char**)&source, NULL, 
                                     &status);

Compile the Program Object

The clBuildProgram() API function compiles all of the kernels in the program object to the first level specified by the hardware's vendor (see next sub-section):

 cl_int
 clBuildProgram(
     cl_program program,         // handle to program
     cl_uint numDevices,         // number of devices associated with program 
     const cl_device_id* device, // pointers to devices
     const char* option,         // null-terminated string of build options
     void (*pfn_notify)(         // callback when build is complete
         cl_program program,
         void* userData),
     void* userData)             // returning error code 
 );

A call to this API function returns the error status that identifies the success of the build:

 cl_int status = CL_SUCCESS; // holds error code
 cl_program program;

 status = clBuildProgram(program, numDevices, device, NULL, NULL, NULL);

Implementer Options

Compilation depends on the hardware vendor.  OpenCL vendors can compile to an intermediate representation or to a device specific binary.  The AMD runtime recognizes two classes of devices: x86 CPUs and GPUs.  For x86 CPUs, its compiler generates x86 instructions.  For GPUs, its compiler generates a high-level intermediate language (IL) that represents the program code.  The Just-In-Time (JIT) compiler will compile this IL into the instruction set architecture (ISA) at runtime.  The NVIDIA runtime generates a high-level intermediate representation in Parallel Thread Execution (PTX) code.  The JIT compiler will compile this PTX code to the current architecture at runtime.  Generating IL code allows for different GPU ISA's associated with different hardware.

Summary

The figure below illustrates the relations between the elements of the OpenCL models.  The magenta (upward) arrows show the direction of object creation, while the blue (downward) arrows show the direction of control flow during execution.

Flow through OpenCL Elements


OpenCL Kernel

An OpenCL kernel, like a CUDA kernel, contains all of the instructions that the basic unit on the device executes.  OpenCL defines that basic unit as a work-item.  OpenCL defines the execution configuration for a kernel in terms of of an NDRange of work-items.  Equal numbers of work-items form work-groups in the execution configuration.  (NDRange stands for n-dimensional range.)

The table below lists the CUDA terms that correspond to these units in OpenCL's execution model.

 OpenCL  CUDA
 Kernel  Kernel
 Host Program  Host Program 
 NDRange (index space)   grid
 work-items  threads
 work-groups  thread blocks 

OpenCL defines the built-in primitives for kernel execution in terms of API functions:

  • get_global_id(i) - global index
  • get_local_id(i) - local index
  • get_global_size(i) - workgroup size

The table below compares these functions with the corresponding CUDA built-in variables.

 Description  OpenCL  CUDA
 Global index of the work-item
in the x dimension
 get_global_id(0)  blockIdx.x * blockDim.x +  
threadIdx.x
 Local index of the work-item
 within the work group
 in the x dimension
 get_local_id(0)  threadIdx.x
 Size of the NDRange
 in the x dimension
 get_global_size(0)  gridDim.x * blockDim.x
 Size of each work group
 in the x dimension
 get_local_size(0)  blockDim.x
 Kernel identifier  __kernel  __global__
 Global memory identifier  __global  

OpenCL does not append a __ suffix to its identifiers.

The arguments (0), (1) and (2) correspond to CUDA's member identifiers .x, .y and .z respectively. 

Example

CUDA

The following CUDA kernel populates a vector b with the sum of scalar value v and the value in each element of vector a.  The CUDA-specific syntax is highlighted:

 __global__ void add(const float* a, float* b, float v) { 
     int id = blockIdx.x * blockDim.x +
              threadIdx.x;
     b[id] = a[id] + v;
 }

OpenCL

The OpenCL kernel performs the same task as defined in the CUDA kernel above.  The OpenCL-specific syntax is highlighted:

 __kernel void add(__global const float* a, __global float* b, float v) { 
     int id = get_global_id(0);
     b[id] = a[id] + v;
 }

Note the differing usage of the keyword global by CUDA and OpenCL.

Create the Kernel

The clCreateKernel() API function extracts a kernel object from a program object. 

 cl_kernel
 clCreateKernel(
     cl_program program,  // handle to program
     const char* fnName,  // kernel identifier
     cl_int* errcode_ret) // returning error code 

A call to this API function returns the handle to the kernel object:

 cl_int status = CL_SUCCESS; // holds error code
 cl_kernel kernel = NULL;

 kernel = clCreateKernel(program, "add", &status); 

Kernel Arguments

The clSetKernelArg() API function defines the arguments to pass during the launch of the kernel object: 

 cl_int
 clSetKernelArg(
     cl_kernel kernel,     // handle to kernel object
     cl_uint argIndex,     // argument index
     size_t argSize,       // size of argument in bytes 
     const void* argValue) // value of the argument

A call to this API function returns the error status, which identifies the success or failure of the definition:

 cl_int status  = CL_SUCCESS; // holds error code
 cl_float value = 3.0f;
 status = clSetKernelArg(kernel, 0, sizeof(cl_mem), bufferIn);
 status = clSetKernelArg(kernel, 1, sizeof(cl_mem), bufferOut);
 status = clSetKernelArg(kernel, 2, sizeof(float), &value);

Kernel Launch

Execution Configuration

We configure the number of work-items as an n-dimensional range using an array of type size_t: n is either 1, 2, or 3.  We group work-items using a separate array also of type size_t

For example, to define a one-dimensional range of 1024 work-items we allocate memory as follows:

 size_t numGlobalWorkItems[3] = {1024, 1, 1};

To define one-dimensional workgroups of 64 work-items, we allocate memory as follows:

 size_t workGroupSize[3] = {64, 1, 1};

These two statements define 16 (= 1024/64) workgroups of 64 work-items each.  The number of work items in each dimension should be exactly divisible by the number of work-items within each workgroup in each dimension.

The Launch

The clEnqueueNDRangeKernel() API function launches a kernel for a specified execution configuration. 

 cl_int
 clEnqueueNDRangeKernel(
     cl_command_queue cmdQueue,      // handle to comand queue
     cl_kernel kernel,               // handle to kernel object
     cl_uint workDim,                // number of dimensions
     const size_t* globalWorkOffset, // offsets into global work-item array
     const size_t* globalWorkSize,   // array with # of work-itmes globally
     const size_t* localWorkSize,    // array with # of work-items in a workgroup 
     cl_uint numEventsInWaitList,    //
     const clEvent* eventWaitList,   //
     cl_event* event)                //

A call to this API function returns the error status that identifies the success or failure of the definition:

 size_t numGlobalWorkItems[3] = {1024, 1, 1};
 size_t workGroupSize[3]      = {  64, 1, 1};
 status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, numGlobalWorkItems, 
                                 workGroupSize, 0, NULL, NULL);

Release the Resources

The final step in an OpenCL application releases all of the allocated resources.  OpenCL defines the following API functions for clean up:

 cl_int
 clReleaseKernel(cl_kernel kernel);
 cl_int
 clReleaseProgram(cl_program program);
 cl_int
 clReleaseCommandQueue(cl_command_queue cmdQueue);
 cl_int
 clReleaseMemObject(cl_mem buffer);
 cl_int
 clReleaseContext(cl_context context);

We release the resources retrieved in this chapter using the following calls:

 status = clReleaseKernel(kernel);
 status = clReleaseProgram(program);
 status = clReleaseCommandQueue(cmdQueue);
 status = clReleaseMemObject(bufferIn);
 status = clReleaseMemObject(bufferOut);
 status = clReleaseContext(context);

 // host resources
 free(platform);
 free(device);

Example

The complete source code for populating a vector with the sum of a scalar value and another a vector is listed below.

Kernel

The kernel is stored in a separate file named vec_add_kernel.cl

 __kernel void add(__global const float* a, __global float* b, float v) { 
    int i = get_global_id(0);
    b[i] = a[i] + v;
 }

Host Program

The host program is stored in a file named opencl.cpp:

 // Vector Addition
 // opencl.cpp

 #include <iostream>
 #include <fstream>
 #include <iomanip>
 #include <cstdlib>
 #include <CL/cl.h>

 const float tolerance = 1.0e-10f;

 #define KERNEL "vec_add_kernel.cl"

 int main(int argc, char* argv[]) {
     if (argc != 3) {
         std::cerr << "***Incorrect number of arguments***\n";
         return 1;
     }
     int   n = std::atoi(argv[1]);
     int  nb = n * sizeof(float);
     float v = std::atof(argv[2]);
     // allocate host memory
     float* a = new float[n];
     float* b = new float[n];
     float r  = 1.0f / RAND_MAX;
     for (int i = 0; i < n; i++)
         a[i] = r * rand();

     // Load Device Program from File KERNEL
     std::ifstream f(KERNEL);
     char c;
     size_t size = 0;
     while (f) {
         f.get(c);
         size++;
     }
     f.clear();
     f.seekg(0);
     char* str = new char[size];
     size = 0;
     while (f)
         f.get(str[size++]);
     f.close();

     // Platform Model
     //===============
     cl_int  status = CL_SUCCESS;
     cl_uint numPlatforms = 0;
     cl_platform_id* platform = NULL;

     // retrieve the number of platforms
     status = clGetPlatformIDs(0, NULL, &numPlatforms);

     // allocate memory for platform data
     platform = (cl_platform_id*)malloc(numPlatforms *
      sizeof(cl_platform_id));

     // populate platform with platform data
     status = clGetPlatformIDs(numPlatforms, platform, NULL);

     cl_uint numDevices = 0;
     cl_device_id* device = NULL;

     // retrieve the number of devices
     status = clGetDeviceIDs(platform[0], CL_DEVICE_TYPE_ALL, 0, NULL,
                             &numDevices);

     // allocate memory for device information
     device = (cl_device_id*)malloc(numDevices * sizeof(cl_device_id));

     // populate device with device information
     status = clGetDeviceIDs(platform[0], CL_DEVICE_TYPE_ALL, numDevices, 
                             device, NULL);

     // Execution Model
     //================
     cl_context context; // handle to an OpenCL context

     // context that includes numDevice devices
     context = clCreateContext(NULL, numDevices, device, NULL, NULL,
                               &status);

     cl_command_queue cmdQueue;

     // create command queue for device[0]
     cmdQueue = clCreateCommandQueue(context, device[0], 0, &status);

     cl_mem bufferIn;
     cl_mem bufferOut;
     bufferIn  = clCreateBuffer(context, CL_MEM_READ_ONLY, nb,
                                NULL, &status);
     bufferOut = clCreateBuffer(context, CL_MEM_WRITE_ONLY, nb,
                                NULL, &status);

     // Program Model
     //==============
     cl_program program;
     program = clCreateProgramWithSource(context, 1, (const char**)&str,
                                         (const size_t*)&size, &status);
     status = clBuildProgram(program, numDevices, device, NULL, NULL, NULL); 

     cl_kernel kernel = NULL;
     kernel = clCreateKernel(program, "add", &status);
     status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufferIn);
     status = clSetKernelArg(kernel, 1, sizeof(cl_mem), &bufferOut);
     status = clSetKernelArg(kernel, 2, sizeof(float), &v);

     // Execution
     //==========
     status = clEnqueueWriteBuffer(cmdQueue, bufferIn, CL_FALSE,
                                   0, nb, a, 0, NULL, NULL);

     size_t globalWorkSize[1];
     globalWorkSize[0] = n;
     status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL,
                                     globalWorkSize, NULL, 0, NULL, NULL);

     status = clEnqueueReadBuffer(cmdQueue, bufferOut, CL_TRUE,
                                  0, nb, b, 0, NULL, NULL);

     // Release Resources
     //==================
     status = clReleaseKernel(kernel);
     status = clReleaseProgram(program);
     status = clReleaseCommandQueue(cmdQueue);
     status = clReleaseMemObject(bufferIn);
     status = clReleaseMemObject(bufferOut);
     status = clReleaseContext(context);
     delete [] device;
     delete [] platform;

     // output errors only
     int ne = 0;
     std::cout << std::fixed << std::setprecision(6);
     for (int i = 0; i < n; i++)
         if (fabs(a[i] + v - b[i]) > tolerance)
             std::cerr << std::setw(3) << ++ne << ' ' <<
              a[i] + v << ' ' << b[i] << std::endl;
     if (ne)
         std::cerr << "Errors encountered" << std::endl;
     else
         std::cout << "No Errors encountered" << std::endl;

     // deallocate host memory
     delete [] a;
     delete [] b;
     delete [] str;
 }

To compile and link this source code enter the following command:

 nvcc opencl.cpp -lOpenCL 

Run the executable using the following commands:

 a 50 3 

Exercises




Previous Reading  Previous: Case Studies Next: OpenCL Memory Model   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License