Part C - OpenCL Programming Model

OpenCL Memory Model

Describe the spaces of the OpenCL memory model
Show how to capture compiler error messages during runtime
Show how to profile kernel executions during runtime

Memory Spaces | Synchronization | Compilation | Profiling | Exercises


OpenCL defines an abstract memory model for programmers to target memory architectures and for vendors to map to specific hardware.  Like CUDA, OpenCL includes a mechanism for synchronizing the execution of work-items in a workgroup.  The runtime includes API functions that collect compiler error messages and profiling statistics.

This chapter describes the OpenCL memory model.  This chapter also describes the syntax for synchronizing the work-item execution within a work-group, for retrieving compiler error messages generated at runtime and for profiling kernel execution.


Memory Spaces

The OpenCL memory model defines four disjoint address spaces:

  1. Global
  2. Constant
  3. Local
  4. Private

The table below compares these memory spaces to those in the CUDA Memory Model.

 OpenCL  OpenCL keyword   Scope  CUDA  CUDA keyword 
 Global  __global  Kernel-wide  Global  
 Constant  __constant  Kernel-wide  Constant   __constant__ 
 Local  __local  Work-group-wide   Shared __shared__
 Private  __private  Work-item-wide  Local  

Global memory is visible to all of the compute units on a device.  All transfers between the host and the device are transfers to and from the device's global memory.  The keyword __global added to a parameter pointer declaration identifies the address of data stored in global memory.

Constant memory is also visible to all of the compute units on the device.  Any element of constant memory is simultaneously accessible by all work-items.  Constant memory is part of global memory.  The keyword __constant added to a pointer parameter declarations identifies the address of data stored in constant memory. 

Local memory is memory that belongs to a compute unit and is presumably implemented on-chip.  Local memory is sharable by all of the work-items within a workgroup.  The keyword __local added to a declaration identifies the data as stored in local memory. 

Private memory is memory that belongs to a work-item and is presumably implemented on-chip in registers.  The keyword __private added to a data declaration identifies the data as stored in private memory.  Note however that both private arrays and any spilled registers reside in global memory. 

OpenCL Memory Model

All function parameters in an OpenCL kernel are in the __private address space.  Function parameters of pointer type can only point to __global, __local or __constant address spaces.  The default space for locally declared variables is __private.


Synchronization

The OpenCL intrinsic function for synchronizing the execution of kernel code on work-items within a work-group is

 barrier(CLK_LOCAL_MEM_FENCE);

A call to this function has the same effect as a call to __synchronize() in a CUDA kernel.

Example

The following kernel calculates the coefficient of a square matrix from the product of two square matrices using tiling and local memory:

 #define TILE_WIDTH 16  // tile width

 __kernel void matMul(__global const float* a, __global const float* b,
  __global float* c, int width) {
     int tx  = get_local_id(0);
     int ty  = get_local_id(1);
     int row = get_global_id(0);
     int col = get_global_id(1);
     __local float s_a[TILE_WIDTH][TILE_WIDTH];
     __local float s_b[TILE_WIDTH][TILE_WIDTH];
     if (row < width && col < width) {
         float sum = 0.0f;
         for (int m = 0; m < width / TILE_WIDTH; m++) {
             s_a[tx][ty] = a[row * width + m * TILE_WIDTH + ty];
             s_b[tx][ty] = b[(m * TILE_WIDTH + tx) * width + col]; 
             barrier(CLK_LOCAL_MEM_FENCE);
             for (int k = 0; k < TILE_WIDTH; k++)
                 sum += s_a[tx][k] * s_b[k][ty];
             barrier(CLK_LOCAL_MEM_FENCE);
         }
         c[row * width + col] = sum;
     }
 }

The locally declared variables width, tx, ty, row, col, and sum are __private variables. 


Compilation Errors

OpenCL supports a relatively straightforward mechanism for capturing compiler-generated error messages during the building of kernel executable code.

The clGetProgramBuildInfo() API function returns the compiler information messages generated during the building of the kernels for the installed device(s). 

The prototype for clGetProgramBuildInfo() is:

 cl_int                                      // error status
 clGetProgramBuildInfo(
     cl_program program,         // handle to the program
     cl_device_id device,        // specifies the device
     cl_program_build_info,      // specifies the information to query
     size_t size,                // size in bytes of memory stream
     void* ptr,                  // address where stream is to be copied 
     size_t* actual_size)        // actual size in bytes of data copied

We call this function twice: once to determine the size of the message stream and the second time to retrieve the message stream. 

 cl_int  status = CL_SUCCESS;
 cl_uint numDevices;   // set in platform model interrogation
 cl_device_id* device; // set in platform model interrogation
 cl_program program;   // set in execution model creation

 // build the program for all installed devices
 status = clBuildProgram(program, numDevices, device, NULL, NULL, NULL);

 if (status == CL_BUILD_PROGRAM_FAILURE) {
     // Determine the size of the message stream
     size_t log_size;
     char* log = NULL;
     clGetProgramBuildInfo(program, device[0], CL_PROGRAM_BUILD_LOG, 0, 
      log, &log_size);

     // Allocate memory for the message stream
     log = (char*)malloc(log_size);

     // Copy the message stream
     clGetProgramBuildInfo(program, device[0], CL_PROGRAM_BUILD_LOG,
      log_size, log, NULL);

     // Print the message stream
     std::cerr << log << endl;

     // Deallocate the memory for the message stream
     free(log);
 }

Profiling

The OpenCL mechanism for profiling kernel execution relies on event processing (using cl_events).

Command Queue

We enable profiling of a particular command queue through the third argument in a call to the clCreateCommandQueue() API function, which defines the properties of the queue.  Passing CL_QUEUE_PROFILING_ENABLE as that argument turns on profiling. 

 cl_int status = CL_SUCCESS; // holds error code
 cl_command_queue cmdQueue;

 cmdQueue = clCreateCommandQueue(context, device[0],
  CL_QUEUE_PROFILING_ENABLE, &status);

Kernel Launch

We gather profiling information through the last argument to a call to the clEnqueueNDRangeKernel() API function.  As the argument we pass the address of a cl_event.  If this address is not NULL, this API function populates the event object that the application has queued for profiling. 

The clGetEventProfilingInfo() API function returns the profiling information requested by the second argument in the call to this function. 

The prototype for clGetEventProfilingInfo() is:

 cl_int                                      // error status
 clGetEventProfilingInfo(
     cl_event event,             // the event object
     size_t size,                // size in bytes of memory stream
     void* ptr,                  // address where stream is to be copied 
     size_t* actual_size)        // actual size in bytes of data copied

We call clGetEventProfilingInfo() once for the start and once for the end of the profile and determine the time consumed during the execution of a kernel by the difference in the values returned as shown below. 

 cl_int status = CL_SUCCESS;
 cl_command_queue cmdQueue; // set in creating execution model
 cl_kernel kernel;          // set in creating program
 const int ntpb = 16;       // number of work units per workgroup
 int n;                     // set from command line and ntpb

 size_t globalWorkSize[3] = {n, n, 1};
 size_t localWorkSize[3]  = {ntpb, ntpb, 1};

 cl_event prof_event;
 status = clEnqueueNDRangeKernel(cmdQueue, kernel, 2, NULL,
                                 globalWorkSize, localWorkSize, 0,
                                 NULL, &prof_event);
 checkError(status, "Enqueue NDRange Kernel"); // error handling function 

 cl_ulong ev_start = (cl_ulong)0;
 cl_ulong ev_end   = (cl_ulong)0;
 clFinish(cmdQueue); // ensures that kernel execution has finished
 status = clWaitForEvents(1, &prof_event); // synchronizes host with prof_event

 status |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, 
  sizeof(cl_ulong), &ev_start, NULL);
 status |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,
  sizeof(cl_ulong), &ev_end, NULL);
 checkError(status, "Profiling");  // error handling function

 std::cout << " kernel took " << (float)(ev_end - ev_start)/1000
  << " micro-seconds" << endl;

Exercises




Previous Reading  Previous: OpenCL Preliminaries Next: Directive Programming   Next Reading


  Designed by Chris Szalwinski   Copying From This Site   
Logo
Creative Commons License