Abstract
State of the art automatic optimization of OpenCL applications focuses on improving the performance of individual compute kernels. Programmers address opportunities for inter-kernel optimization in specific applications by ad-hoc hand tuning: manually fusing kernels together. However, the complexity of interactions between host and kernel code makes this approach weak or even unviable for applications involving more than a small number of kernel invocations or a highly dynamic control flow, leaving substantial potential opportunities unexplored. It also leads to an over complex, hard to maintain code base.
We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intra-kernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from
composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms hand-optimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.
We present Helium, a transparent OpenCL overlay which discovers, manipulates and exploits opportunities for inter-and intra-kernel optimization. Helium is implemented as preloaded library and uses a delay-optimize-replay mechanism in which kernel calls are intercepted, collectively optimized, and then executed according to an improved execution plan. This allows us to benefit from
composite optimizations, on large, dynamically complex applications, with no impact on the code base. Our results show that Helium obtains at least the same, and frequently even better performance, than carefully handtuned code. Helium outperforms hand-optimized code where the exact dynamic composition of compute kernel cannot be known statically. In these cases, we demonstrate speedups of up to 3x over unoptimized code and an average speedup of 1.4x over hand optimized code.
Original language | English |
---|---|
Title of host publication | Proceedings of the 8th Workshop on General Purpose Processing Using GPUs |
Place of Publication | New York |
Publisher | Association for Computing Machinery |
Pages | 70-80 |
Number of pages | 11 |
ISBN (Electronic) | 978-1-4503-3407-5 |
DOIs | |
Publication status | Published - 7 Feb 2015 |
Event | 8th Workshop on General Purpose Processing Using GPUs - San Francisco, United States Duration: 7 Feb 2015 → 8 Feb 2015 |
Conference
Conference | 8th Workshop on General Purpose Processing Using GPUs |
---|---|
Abbreviated title | GPGPU 2015 |
Country/Territory | United States |
City | San Francisco |
Period | 7/02/15 → 8/02/15 |
Keywords
- GPGPU
- JIT compilation
- OpenCL
- inter-kernel optimization
- profiling
- staging