# Fully automatic adjoints of large-scale numerical simulations

Charoenwanit, Ekkapot; Naumann, Uwe (Thesis advisor); Slusanschi, Emil-Ioan (Thesis advisor)

*Aachen (2019)*

Dissertation / PhD Thesis

Dissertation, RWTH Aachen University, 2019

Abstract

This dissertation is concerned with algorithmic differentiation (AD), which is a method for algorithmically computing the derivatives of mathematical functions implemented as computer programs. Traditionally, there are a number of numerical methods for computing derivatives that scientists and engineers commonly use. One of these methods that are widely used is the method of finite differences. The method of finite differences is very straightforward and easy to implement as a computer program as it is a non-intrusive approach in the sense that it does not require any change to the function of interest. Nevertheless, since it is only an approximating method, it requires that a sufficiently small step size must be chosen so that a reasonably good approximation to the exact derivatives can be obtained. Such numerical inaccuracy due to the wrong choice of step size can potentially and adversely affect the convergence behaviour of numerical simulations. Therefore, we need a method for computing derivatives that is both efficient and exact in the sense that it does not suffer from such numerical inaccuracy. Algorithmic differentiation can address these problems because AD can be used to compute exact derivatives accurately up to machine precision. AD provides a much more computationally efficient method for computing derivatives for certain kinds of mathematical functions than the method of finite differences. There are several approaches in AD to computing derivatives for different kinds of mathematical functions. For the kind of functions where the number of outputs m is far greater than than the number of inputs n, the tangent mode of AD is often used. On the contrary, the adjoint mode of AD is often preferably used for the kind of functions where the number of inputs n is far greater than the number of outputs m. Althoughthe tangent mode is relatively cheap in terms of memory storage, it is computationally expensive for functions with m << n. Similarly, although the adjoint mode is relatively cheap in terms of runtime, it requires a relatively large amount of storage, especially for large problems. In the adjoint mode, an adjoint program records derivative information on a randomly accessed data structure called the tape, which is essentially the linearised directed acyclic graph (linearised DAG), during the so called forward sweep, and interprets the tape according to the derivative information recorded on the tape during the forward sweep to accumulate the gradient during the so called reverse sweep. During there verse sweep, adjoints are propagated in the reverse order of the control flow of the original program. In this way, adjoints are propagated from the outputs to the inputs of the adjoint program. In the case where the number of inputs n and the number of outputs m are roughly the same and much greater than one, the tangent mode of AD is preferable because it has a lower computational overhead for computing a column than the adjoint mode for computing a row of the Jacobian. Moreover, the tangent mode requires a smaller memory requirement than the adjoint mode. This dissertation aims to develop algorithms for automatically accumulating the Jacobian of any arbitrarily large linearised DAG in order to solve the problem of running out of memory when the linearised DAG is too large to fit in main memory of the machine being used. In this dissertation, we focus on adjoint problems where m << n. With an insufficient amount of memory storage, an adjoint program will eventually run out of memory at a certain point and crash when the problem size is too large to fit in main memory. To this end, check pointing can be employed in order to prevent the adjoint program from running out of memory by 1) saving the results of intermediate computations at certain points, 2) restoring these values from the checkpoints and 3) resuming execution from these checkpoints onwards. By doing this, we can avoid the need to redundantly recompute everything from scratch at the expense of a certain amount of storage for storing the values from the intermediate computations that were previously check pointed. Nevertheless, the use of check pointing schemes requires a fairly good knowledge of the adjoint mode of AD, which can be rather complicated for a beginner and sometimes for even an intermediate user. Therefore, it is highly desirable that we have a software tool for AD that can transparently solve the memory problem in an automatic fashion without the need for users to manually checkpoint and parallelise the original AD code. To deal with the a for emention edissues, we developed a fully black-box AD software tool based on operator overloading that does not require any user intervention at all and can automatically run any arbitrary numerical problemof arbitrarily large size written in the C++ programming language. The AD software tool providesusers with a new overloaded active data type together with a set of user-friendly APIs (ApplicationProgramming Interfaces). The resulting derivative code stays virtually the same as the original sourcecode except for a couple of trivial changes that need to be made to the original source code.In Chapter 3, we propose and discuss several serial implementations, namely, Serial Global VertexElimination (SGVE), Iterative Vertex Elimination (IVE) and Serial Vertex Elimination using GraphPartitioning (SVEGP).SGVE is essentially the conventional adjoint mode of AD, which generates the entire linearisedcomputational graph of a function of interest in main memory and performs vertex elimination on thegraph in the reverse order in one go. Therefore, SGVE requires a huge amount of memory for largeproblems.IVE was developed in order to attempt to reduce memory consumption. IVE manages to decreasememory consumption to a certain extent. However, it suffers from poor runtime behaviour due to thefact that it locally performs either forward or reverse vertex elimination on subgraphs once the memoryupper bound is reached. Hence, this effectively renders the resulting vertex elimination sequence across-country ordering, and causes its runtime performance to become much worse compared to thatof SGVE.Therefore, SVEGP was developed in such a way that it generates the subgraphs of the linearisedcomputational graph in the reverse order. SVEGP starts by 1) partitioning the entire linearised graphinto several subgraphs, each of which does not exceed a given memory upper bound, and 2) generatingthese subgraphs in the reverse order. After the generation of each subgraph, SVEGP will eliminatethe dead intermediate vertices from the subgraph using reverse vertex elimination. This is effectivelyequivalent to eliminating all of the intermediate vertices from the linearised computational graph usingreverse vertex elimination in one go as in the case of SGVE. By doing so, SVEGP can preserve thenumber of floating point multiplications, which remains the same as that in SGVE using reverse vertexelimination. However, its computational overhead also depends on the number of subgraphs as everytime before each subgraph can be generated, SVEGP is required to recompute from scratch by restoringthe saved inputs. Therefore, an adjoint program using SVEGP incurs a computational overhead thatgrows quadratically in the number of subgraphs.To mitigate the additional computational overhead in SVEGP that grows quadratically in the numberof subgraphs, we developed a parallel implementation called Parallelised Adjoint Accumulation(PAA) discussed in detail in Chapter 4, where the linearised DAG is partitioned into multiple subgraphs,each of which can fit in main memory. PAA uses MPI (Message Passing Interface) to distributethese subgraphs across the available processors. These processors can also perform preaccumulationon their own subgraphs in parallel, which can help further improve the runtime performance of theaccumulation process. Experimental results from applying PAA to several case studies show that PAAcan successfully run to completion without exceeding the given memory upper bounds and can solvethe runtime problem SGVE is faced with. Moreover, our experimental results show that PAA canachieve significant speedups when compared with the original serial version SVEGP, which suffersfrom a time complexity that grows quadratically in the number of subgraphs. Furthermore, the resultsalso demonstrate that PAA outperforms tangent-mode AD in terms of runtime, which performs poorlywhen the number of inputs n is sufficiently large compared with the number of outputs m.

### Identifier

- DOI: 10.18154/RWTH-2019-05051
- RWTH PUBLICATIONS: RWTH-2019-05051