FHPNC 2019- Proceedings of the 8th ACM SIGPLAN International Workshop on Functional High-Performance and Numerical Computing

Full Citation in the ACM Digital Library


Generating efficient FFT GPU code with Lift

The Fast Fourier Transform is a well-known algorithm used in many high-performance applications, ranging from signal processing to convolutional neural networks.

In this paper, we encode FFTs by building high-level abstractions based on a set of functional parallel patterns in the Lift language. Abstractions are derived from and closely resemble mathematical definitions for FFTs. We leverage the Lift performance-portable code generator to generate high performing GPU code for FFTs. No FFT-specific patterns are required for this, showing the expressive power of the generic parallel patterns used in Lift.

Our experimental results show that our approach achieves performance better than AMD's OpenCL implementation clFFT on an Nvidia GPU. Nvidia's highly optimized cuFFT implementation still performs better on their GPUs.

Position-dependent arrays and their application for high performance code generation

Modern parallel hardware promises unprecedented performance, for the gifted few experts who can program it correctly. Code generators from high-level languages provide an attractive alternative, promising to deliver high performance automatically.

Existing projects such as Accelerate, Futhark, Halide, or Lift show that this approach is feasible. Unfortunately, existing efforts focus on computations over tensors: regularly shaped higher dimensional arrays. This limits the expressiveness of these approaches and excludes many interesting data structures that are commonly encoded manually in memory, such as trees or triangular matrices.

This paper presents an extended array type that lifts this restriction. For multidimensional arrays, the size of a nested array might depend on its position in the surrounding arrays, enabling the expression of computations over less regularly shaped data structures. However, position-dependent arrays bring new challenges for high-performance code generation, as indexing elements in memory becomes more challenging.

This paper shows how these challenges are addressed by extending the existing Lift type system and compiler. The experimental results show that this approach enables the efficient code generation of triangular matrix-vector multiplication, with performance improvements over cuBLAS on an Nvidia GPU by up to 2×. Furthermore, we show a use case for a low-level optimization for avoiding unnecessary out-of-bound checks in stencils, leading to up to 3× improvements over already optimized generated stencil codes.

Lazy evaluation in infinite-dimensional function spaces with wavelet basis

Vectors in numerical computation, i.e., arrays of numbers, often represent continuous functions. We would like to reflect this with types. One apparent obstacle is that spaces of functions are typically infinite-dimensional, while the code must run in finite time and memory.

We argue that this can be overcome: even in an infinite-dimensional space, the vectors can in practice be stored in finite memory. However, dual vectors (corresponding essentially to distributions) require infinite data structure. The distinction is usually lost in the finite dimensional case, since dual vectors are often simply represented as vectors (by implicitly choosing a scalar product establishing the correspondence). However, we shall see that an explicit type-level distinction between functions and distributions makes sense and allows directly expressing useful concepts such as the Dirac distribution, which are problematic in the standard finite-resolution picture.

The example implementation uses a very simple local basis that corresponds to a Haar Wavelet transform.

Safety at speed: in-place array algorithms from pure functional programs by safely re-using storage

We present a purely functional array programming language that offers safe, purely functional and crash-free in-place array transformations. The language supports high-level abstractions for pure and efficient array computations that fully support equational reasoning. We show how to execute selected parts of these computations safely in-place, with the compiler guaranteeing that in-place execution does not change the computation’s result. Correctness is ensured by using an off-the-shelf-theorem prover to discharge safety conditions. Our main contribution is the idea of virtual copies for expressing re-use of arrays, and techniques for verifying their safety, which allow a pure language to include in-place transformations without weakening its transparency or reasoning power.

Compositional deep learning in Futhark

We present a design pattern for composing deep learning networks in a typed, higher-order fashion. The exposed library functions are generically typed and the composition structure allows for networks to be trained (using back-propagation) and for trained networks to be used for predicting new results (using forward-propagation). Individual layers in a network can take different forms ranging over dense sigmoid layers to convolutional layers. The paper discusses different typing techniques aimed at enforcing proper use and composition of networks. The approach is implemented in Futhark, a data-parallel functional language and compiler targeting GPU architectures, and we demonstrate that Futhark's elimination of higher-order functions and modules leads to efficient generated code.