Abstract
This paper describes our experience in implementing the classical N-body algorithm in SaC and analysing the runtime performance achieved on three different machines: a dual-processor 8-core Dell PowerEdge 2950 (a Beowulf cluster node, the reference machine), a quad-core hyper-threaded Intel Core-i7 based system equipped with an NVidia GTX-480 graphics accelerator and an Oracle Sparc T4-4 server with a total of 256 hardware threads. We contrast our findings with those resulting from the reference C code and a few variants of it that employ OpenMP pragmas as well as explicit vectorisation. Our experiments demonstrate that the SaC implementation successfully combines a high level of abstraction, very close to the mathematical specification, with very competitive runtimes. In fact, SaC matches or outperforms the hand-vectorised and hand-parallelised C codes on all three systems under investigation without the need for any source code modification. Furthermore, only SaC is able to effectively harness the advanced compute power of the graphics accelerator, again by mere recompilation of the same source code. Our results illustrate the benefits that SaC provides to application programmers in terms of coding productivity, source code, and performance portability among different machine architectures, as well as long-term maintainability in evolving hardware environments. Copyright © 2013 John Wiley & Sons, Ltd.
Original language | English |
---|---|
Pages (from-to) | 952-971 |
Number of pages | 20 |
Journal | Concurrency and Computation: Practice and Experience |
Volume | 26 |
Issue number | 4 |
DOIs | |
Publication status | Published - 25 Mar 2014 |
Keywords
- auto-parallelisation
- data parallelism
- functional programming
- graphics cards
- high performance
- high productivity
- single assignment C
- vectorisation