OpenCL defines an abstract memory model for programmers to target
memory architectures and for vendors to map to specific hardware.
Like CUDA, OpenCL includes a mechanism for synchronizing the execution
of work-items in a workgroup. The runtime includes API functions
that collect compiler error messages and profiling statistics.
This chapter describes the OpenCL memory model. This chapter also
describes the syntax for synchronizing the work-item execution within a
work-group, for retrieving compiler error messages generated at runtime
and for profiling kernel execution.
Memory Spaces
The OpenCL memory model defines four disjoint address spaces:
- Global
- Constant
- Local
- Private
The table below compares these memory spaces to those in the CUDA
Memory Model.
OpenCL |
OpenCL keyword |
Scope |
CUDA |
CUDA keyword |
Global |
__global |
Kernel-wide |
Global |
|
Constant |
__constant |
Kernel-wide |
Constant |
__constant__ |
Local |
__local |
Work-group-wide |
Shared |
__shared__ |
Private |
__private |
Work-item-wide |
Local |
|
Global memory is visible to all of the compute units on a device.
All transfers between the host and the device are transfers to and from
the device's global memory. The keyword __global
added to a parameter pointer declaration identifies the address of data stored
in global memory.
Constant memory is also visible to all of the compute units on the
device. Any element of constant memory is simultaneously accessible
by all work-items. Constant memory is part of global memory.
The keyword __constant added to a pointer parameter
declarations identifies the address of data stored in constant memory.
Local memory is memory that belongs to a compute unit and is presumably
implemented on-chip. Local memory is sharable by all of the work-items
within a workgroup. The keyword __local added to
a declaration identifies the data as stored in local memory.
Private memory is memory that belongs to a work-item and is presumably
implemented on-chip in registers. The keyword __private
added to a data declaration identifies the data as stored in private memory.
Note however that both private arrays and any spilled registers reside in global
memory.

All function parameters in an OpenCL kernel are in the
__private address space.
Function parameters of pointer type can only point to
__global, __local
or __constant address spaces. The
default space for locally declared variables is
__private.
Synchronization
The OpenCL intrinsic function for synchronizing the execution of kernel
code on work-items within a work-group is
barrier(CLK_LOCAL_MEM_FENCE);
|
A call to this function has the same effect as a call
to __synchronize() in a CUDA kernel.
Example
The following kernel calculates the coefficient of a square matrix from
the product of two square matrices using tiling and local memory:
#define TILE_WIDTH 16 // tile width
__kernel void matMul(__global const float* a, __global const float* b,
__global float* c, int width) {
int tx = get_local_id(0);
int ty = get_local_id(1);
int row = get_global_id(0);
int col = get_global_id(1);
__local float s_a[TILE_WIDTH][TILE_WIDTH];
__local float s_b[TILE_WIDTH][TILE_WIDTH];
if (row < width && col < width) {
float sum = 0.0f;
for (int m = 0; m < width / TILE_WIDTH; m++) {
s_a[tx][ty] = a[row * width + m * TILE_WIDTH + ty];
s_b[tx][ty] = b[(m * TILE_WIDTH + tx) * width + col];
barrier(CLK_LOCAL_MEM_FENCE);
for (int k = 0; k < TILE_WIDTH; k++)
sum += s_a[tx][k] * s_b[k][ty];
barrier(CLK_LOCAL_MEM_FENCE);
}
c[row * width + col] = sum;
}
}
|
The locally declared variables width,
tx, ty,
row, col,
and sum
are __private variables.
Compilation Errors
OpenCL supports a relatively straightforward mechanism for capturing compiler-generated
error messages during the building of kernel executable code.
The clGetProgramBuildInfo() API function
returns the compiler information messages generated during
the building of the kernels for the installed device(s).
The prototype for clGetProgramBuildInfo() is:
cl_int // error status
clGetProgramBuildInfo(
cl_program program, // handle to the program
cl_device_id device, // specifies the device
cl_program_build_info, // specifies the information to query
size_t size, // size in bytes of memory stream
void* ptr, // address where stream is to be copied
size_t* actual_size) // actual size in bytes of data copied
|
We call this function twice: once to determine the size of the
message stream and the second time to retrieve the message stream.
cl_int status = CL_SUCCESS;
cl_uint numDevices; // set in platform model interrogation
cl_device_id* device; // set in platform model interrogation
cl_program program; // set in execution model creation
// build the program for all installed devices
status = clBuildProgram(program, numDevices, device, NULL, NULL, NULL);
if (status == CL_BUILD_PROGRAM_FAILURE) {
// Determine the size of the message stream
size_t log_size;
char* log = NULL;
clGetProgramBuildInfo(program, device[0], CL_PROGRAM_BUILD_LOG, 0,
log, &log_size);
// Allocate memory for the message stream
log = (char*)malloc(log_size);
// Copy the message stream
clGetProgramBuildInfo(program, device[0], CL_PROGRAM_BUILD_LOG,
log_size, log, NULL);
// Print the message stream
std::cerr << log << endl;
// Deallocate the memory for the message stream
free(log);
}
|
Profiling
The OpenCL mechanism for profiling kernel execution relies on event processing
(using cl_events).
Command Queue
We enable profiling of a particular command queue through the third argument in a
call to the clCreateCommandQueue() API function, which
defines the properties of the queue. Passing
CL_QUEUE_PROFILING_ENABLE as that argument turns on
profiling.
cl_int status = CL_SUCCESS; // holds error code
cl_command_queue cmdQueue;
cmdQueue = clCreateCommandQueue(context, device[0],
CL_QUEUE_PROFILING_ENABLE, &status);
|
Kernel Launch
We gather profiling information through the last argument to a call to the
clEnqueueNDRangeKernel() API function. As the
argument we pass the address of a cl_event.
If this address is not NULL, this API function populates the event object
that the application has queued for profiling.
The clGetEventProfilingInfo() API function returns
the profiling information requested by the second argument in the call to
this function.
The prototype for clGetEventProfilingInfo() is:
cl_int // error status
clGetEventProfilingInfo(
cl_event event, // the event object
size_t size, // size in bytes of memory stream
void* ptr, // address where stream is to be copied
size_t* actual_size) // actual size in bytes of data copied
|
We call clGetEventProfilingInfo()
once for the start and once for the end of the profile and
determine the time consumed during the execution of a kernel
by the difference in the values returned as shown below.
cl_int status = CL_SUCCESS;
cl_command_queue cmdQueue; // set in creating execution model
cl_kernel kernel; // set in creating program
const int ntpb = 16; // number of work units per workgroup
int n; // set from command line and ntpb
size_t globalWorkSize[3] = {n, n, 1};
size_t localWorkSize[3] = {ntpb, ntpb, 1};
cl_event prof_event;
status = clEnqueueNDRangeKernel(cmdQueue, kernel, 2, NULL,
globalWorkSize, localWorkSize, 0,
NULL, &prof_event);
checkError(status, "Enqueue NDRange Kernel"); // error handling function
cl_ulong ev_start = (cl_ulong)0;
cl_ulong ev_end = (cl_ulong)0;
clFinish(cmdQueue); // ensures that kernel execution has finished
status = clWaitForEvents(1, &prof_event); // synchronizes host with prof_event
status |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START,
sizeof(cl_ulong), &ev_start, NULL);
status |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,
sizeof(cl_ulong), &ev_end, NULL);
checkError(status, "Profiling"); // error handling function
std::cout << " kernel took " << (float)(ev_end - ev_start)/1000
<< " micro-seconds" << endl;
|
Exercises
|