The Blender compositor uses Nodes. Nodes are independent bricks that can be configured in a network. The input of the compositor is a rendered image and the output is the final image. Blender can show and track intermediate results. These intermediate results can be displayed by the Viewer node during compositing. These and other work-flow functionalities makes it hard to optimize network especially when using GPU-computing.
The current work-flow makes it complicated to optimize a compositing network using GPU-computing. The best way to get results with standard home equipment is to optimize per node. The transition between the different nodes is always a task for the CPU. When using GPU there will be a lot of device to device transfers. For every node data is uploaded to the device, the device does its calculation (executing the node) and finally the result is downloaded from the device. The overhead (uploading, downloading) can be be more than the actual increase of computation speed. making the current implementation faster than the GPU implementation.
Another option is to make use of Multi-core CPU’s. The overhead of a Multi-core implementation is the task scheduler and task merger. The overhead for Multi-core CPU is much less making it more likely to be faster for nodes. There are only a few nodes where GPU computation will be much faster.
In the next table you will see the expected best option for all nodes on a standard home-PC. (dual core or quad core CPU with a mid-range NVidia Geforce).
The next step in our research is to implement a Multi-core implementation of some nodes to make better comparisons. Most nodes use a pixel_processor. A pixel_processor is an implementation where Blender will go over every output pixel and use a callback to determine the pixel color. These pixel_processors can with some simple adjustments be made Multi-core enabled. During tests we have found that this implementation is better for 4 core machines or node-executions that take longer than 3 seconds.