OpenCL and NVidia

thumbnail-cpu

OpenCL is a framework for writing programs that can execute on multiple devices in parallel. The standard is initially developed by Apple and has been made open under governance of the khronos group. The standard describes how the API of a device-driver must look like in order to support OpenCL. When reading only the paper version of the OpenCL standard a lot seems possible, but when using OpenCL for a specific platform you have to be very careful how to address OpenCL. Many vendors supply and support the OpenCL framework, each having their own implementations. AMD, NVIDIA, Apple, Intel, IBM, all support OpenCL. Here the challenge starts.

The project we are working on has the goal to use multiple platforms and multiple devices. In OpenCL you first have to create the context, that contains all platforms and devices you want to use. Querying for all available devices is not that straightforward as the OpenCL specification implies. The specification states that when properties are not given to the createContext methods, that the platform should decide by itself what the result should be. This is kinda strange because when querying over multiple platforms you really want calls to behave the same. This issue we solved by first querying all platforms and per platform query all devices and to construct a single device list. This device list can be used to create the context. So the context knows all platforms and devices before excuting Programs.

Having a standard API does not mean that each vendors will implement accordingly, especially when the standard leaves room for interpretation. During the compositor redesign we wanted to use asynchronous calls to stress the command queue as much as possible. First load all data on the device, schedule the kernel with an event then schedule a native kernel in order to read the result back. When the device is busy executing a kernel, another data and kernel could be scheduled, making sure that the command queue always has some work. The problems started when testing on NVidia’s platform. Native kernels are not supported by this platform. As NVidia’s platform is build on top of its CUDA platform and CUDA does not have native kernels, this might be the reason, but we don’t know for sure. At least the standard tells that natove kernels are optional. So you’ll have to find out yourself who supports it and who does not.

In general NVidia’s OpenCL platform is build on top of the CUDA platform and this also is visible during compilation and loading of the OpenCL kernels to devices. In order to optimize kernel execution we wanted to in-line compile some variables and implement a cache of already available kernels. OpenCL has the ability to download compiled code for a specific device. But when doing this on NVidia’s platform you will not get the actual binary that is loaded on the device, but a (CUDA) intermediate binary called PTX. NVidia’s uses this to share the same binary over different device-families. But when loading back on the device additional time is needed to compile from the intermediate binary to the device actual binary. This takes time and reduces performance.

I would not say that NVidia’s platform is a bad platform, they only made some decisions that limits OpenCL’s usage. When using OpenCL and you want to support multiple platforms we advice to test on all these platforms during realization. The outcome of the test can result in redesigning your application.

Comments are closed.