🛠️

Accelerating PINN Experiments with nn-TangentProp - January 2025

Kyle R. Chickering - krchickering@ucdavis.edu

[ Homepage ]

✌🏻

This blog post accompanies this paper, these slides, and this GitHub repository. This blog post includes links to various sources, but the citation quality of this post is not commensurate with scientific standards. Please read the references in the paper for more formal citations.

In January of this year the world’s most famous mathematician, Terrence Tao, cited PINNs (Physics Informed Neural Networks) as a promising numerical method for assisting mathematicians in the search of a solution to the Navier-Stokes Millennium prize problem, which would garner the solver one million dollars!

The paper he cited (which can be found here) was also the motivation for the work we are discussing in this blog post. The authors use PINNs to find solutions to the Bousinesq equation, a complicated set of equations in self-similar coordinates. Their work is an important step in the quest to understanding how the solutions to fluid equations break-down in finite-time, and these types of solutions (nonlinear and nonlocal) are extremely difficult to compute using standard numerical methods (see for example this paper, which solves a similar problem using traditional numerics).

PINNs are neural networks (the “NN” in the name), which entails its own set of peculiarities when solving problems. The deep-learning revolution of the last decade has undeniably demonstrated the effectiveness of neural networks, but their use in numerical computation presents a tradeoff: we can compute interesting solutions at the cost of an enormous amount of computational power and time. The “neural network” portion of PINNs means that these methods are well suited to parallelization on GPUs, but the “physics informed” portion means these methods take an exponential amount of compute for higher-order equations (which we will discuss this below).

There are, in my view, three main impediments to mobilizing PINN techniques on an increasing number of problems

  1. Training PINNs is highly computationally inefficient. We discuss some of the reasons for this below.
  1. There is no rigorous error analysis for PINNs in general which makes assessing their accuracy difficult.
  1. PINN training is somehow unique in the neural network literature in that its training is highly unstable and we don’t have a clear sense of why (I am planning to write a blog post on this topic soon). Two recent results have begun the study of these instabilities, but much work remains to be done.

In this blog post (which is based on this paper which I wrote in December) we will present a technique for addressing the first point above: the computational inefficiency of training PINNs. In particular we propose an alternative algorithm to autograd which can compute the derivatives of a neural network (the “physics informed” part of PINN) in linear rather than exponential time. Due to the fact that most PINN applications take at least three derivatives, this reduction in asymptotic runtime yields significant speedups in end-to-end PINN training times.

Table of Contents

A Quick Primer on PINN Training

Deep Feed-Forward Neural Nets

The audience for this blog post may not be familiar with neural networks, and we give a quick overview of some basic concepts in neural network theory.

One of the reasons that we care so much about neural networks is that they are universal function approximators, meaning that for a given function and a desired precision there exists a sufficiently large neural network with weights such that the neural network approximates the function within that error tolerance.

Finding these weights is always theoretically possible, but can be practically difficult. In fact, many of the advances in neural networks and deep learning over the last few decades have been advances in training models.

We typically train neural networks using an optimization method called gradient descent. Gradient descent requires a supervised learning target, which means to train a network we must know the answer for a given set of inputs. Suppose that we have a set of pairs (xi,yi)(x_i, y_i) and a network uu which makes predictions. The learning target for such a neural network could be something like

L(u)=iyiu(xi)2,L(u)=\sum_{i}|y_i-u(x_i)|^2,

which is the mean-squared error and LL is our loss function. Gradient descent updates the parameters of the neural network via the update rule

θθηθL(u),\theta \leftarrow \theta -\eta\nabla_\theta L(u),

η\eta being a parameter which controls how large the weight updates are. This is an extremely simplified picture of training neural networks which we hope suffices for the purposes of this blog post.

PINNs

PINNs are neural networks which are trained to approximate the solution to a given ODE or PDE. Because neural networks with smooth activation functions are CC^\infty functional approximators, and CC^\infty functions are dense in the function spaces which we generally care about (think L2L^2 and Sobolev spaces), we can train a neural network using gradient descent to approximate the solution of the give differential equation.

Introductory literature abounds on the topic of PINNs and is easy to find online, and we concern ourselves here only with the highest-level details.

Let uθ(x)u_\theta(\bm{x}) be a feed-forward, densely connected, neural network with parameters θ\theta. Our goal is to optimize θ\theta so that uθu_\theta is a sufficiently good CC^\infty approximation to the solution uu of the differential equation F(αu;x)=0F(\partial^\alpha u; \bm{x})=0. We train the neural network over the discrete domain Ω={x1,x2,,xN}\Omega=\{\,\bm{x}_1, \bm{x}_2, \cdots, \bm{x}_N\,\} using the loss target

L(uθ)=1Nk=1NF(αuθ;xk)2+BC,(1)L(u_\theta) = \frac{1}{N}\sum_{k=1}^N|F(\partial^\alpha u_\theta; \bm{x}_k)|^2+\text{BC},\tag{1}

which is the mean-squared error (MSE) of the differential equation residual FF, and where BC\text{BC} is an additional term which encodes the boundary conditions in the problem. Since uθu_\theta is CC^\infty with a known derivative structure, we can exactly compute all of the derivatives αuθ\partial^\alpha u_\theta appearing above in the loss function.

In theory, the approximation theorem ensures that we will converge to a good approximation of the (or a) true solution to FF given a neural network with sufficient capacity. In practice optimizing this loss function proves difficult due to a poorly conditioned Hessian, and there is no general theoretical error analysis available yet.

Training PINNs is further complicated by needing to enforce boundary conditions which serves to introduce problems typical of multi-target optimization. There are a slew of other practical problems that pop up and solutions to these problems are well studied in the literature (see our paper for further discussion).

We mention one more aspect of training PINNs which is relevant to the present discussion. Training a PINN based solely on the first-order loss target (1)(1) yields slow convergence in practice, often painfully slow. While the reasons for this are varied and only partially understood, it has been shown empirically that training convergence is greatly improved by replacing the first-order loss function (1)(1) with the higher-order Sobolev loss function

LSobolev(m)(uθ)=1Nk=1Nj=0mQjxjF(αu;xk)2+BC,(2)L^{(m)}_{\text{Sobolev}}(u_\theta) = \frac{1}{N}\sum_{k=1}^N\sum_{j=0}^m Q_j|\nabla_{\bm{x}} ^jF(\partial^\alpha u; \bm{x}_k)|^2 + \text{BC}, \tag{2}

where the QjQ_j are relative weights and are treated as additional hyperparameters. What is important in the context of our present discussion is that the Sobolev loss function requires taking additional spatial derivatives during training.

PINNs: The Good and the Bad

PINNs occupy a somewhat strange place in the world of numerics. On the one hand, they allow for the computation of seemingly otherwise intractable numerical problems. On the other, they are slow to train, prone to poor convergence, and lack rigorous error analysis. These issues have raised concerns over the utility of PINNs as a numerical method, and I would venture to argue that for problems where traditional numerics already do a good job, PINNs are not worth pursuing except as an academic exercise.

However, a use case for which I feel PINNs are particularly suited is finding self-similar blowup profiles to challenging problems in fluids. Breakdown phenomena like shock formation can be understood as arising as a consequence of competing scalings between spatial and temporal features according to a prescribed power law. By converting into the coordinate system associated with these power law coordinates we can transform the shock formation process from a complicated nonlinear PDE to a much simpler nonlinear ODE.

This is the sort of problem that we mentioned in the introduction. These problems tend to be posed as boundary value problems with singular update rules. For example, the self-similar Burgers equation, written in it’s update form, reads

W(X)=γW(X)(1+γ)X+W(X),W'(X)=-\frac{\gamma W(X)}{(1+\gamma)X+W(X)},

which is naïvely singular at the origin. This in itself is not insurmountable, however we also require to that we solve for WW and γ\gamma simultaneously. The constraint that we impose to find γ\gamma is a smoothness constraint on WW; WW is smooth only for a countably infinite sequence of value for γ\gamma. I am not arguing that this task cannot be done using standard numerical methods, only that it is difficult to do so using current methodologies. PINNs allow us to have an exact form for an approximation of WW and thus solve for both WW and γ\gamma simultaneously using gradient descent on the target (1)(1) or the Sobolev target (2)(2).

In summary, my belief is that PINNs are a niche numerical method which are impractical for many problems, and well-suited to a small number of problems.

Autodiff and nn-TangentProp

We assume that the reader has little to no familiarity with deep learning and neural networks, so we take a moment here to introduce the bare minimum on autodiff as well as give a brief historical perspective before diving into the technical details which motivate our research.

Autodiff

I would argue that the modern deep learning revolution would have been impossible without the introduction, implementation, and contemporary understanding of the autodifferentiation (autodiff) algorithm. Given a neural network f(x;θ)f(\bm{x};\bm{\theta}) with parameters θ\bm{\theta} which can be expressed as a directed acyclic graph (DAG), autodiff computes x,θf\nabla_{\bm{x},\bm{\theta}}f efficiently by running a forward pass through the DAG, storing the forward activation at each node, and then running a backwards pass through the same DAG and accumulating the gradients at each node. Since the optimization step for a neural network looks like

θθηθL(f),\bm{\theta} \leftarrow \bm{\theta} - \eta \nabla_{\bm{\theta}}L(f),

it becomes clear why the efficiency of autodiff fueled the deep learning revolution in the 2010’s: being able to compute these updates at scale allowed for the training of previously unthinkably large models. Since the last two decades have taught us that bigger models are usually better, autodiff can be seen as the héro des temps.

We begin with a high-level overview of autodiff, and then explain why this algorithm is poorly suited for the PINN use case. Fundamentally, autodiff works by representing every operation as a node in a directed acyclic computational graph (DAG) where each node has a forward operation and a derivative operation:

The figure above is the computational graph for the evaluation of the function σ\sigma. Thus, given the input ψij\psi_{ij}, this computational graph represents computing the value σ(ψij)\sigma(\psi_{ij}). We compute this value by following the black arrows forward through the graph. At this node we also have a derivative operation, which is itself a node on a computational graph. Namely it is the operation σ(ψij)Dij\sigma’(\psi_{ij})\odot D_{ij} where ψij\psi_{ij} and DijD_{ij} are the node’s inputs and \odot is the Hadamard (element-wise) product. For our simple example, Dij=1D_{ij}=1, but for more complicated examples DijD_{ij} will be the value of the derivative of the upstream graph with respect to the current node. Thus, to compute the derivative for our forward operation, we follow the orange arrows backwards through the same graph.

This example is, of course, too simple to be useful. We stress that power of computational graphs extends well beyond their usage in neural networks. For the purposes of this exposition we will use a simple two-layer neural network with an output projection layer to illustrate our ideas, however they will apply to neural networks of any size. Let xRx\in \bm{R}, W1Md×1W^1 \in \mathcal{M}_{d\times 1}, W2Md×dW^2 \in \mathcal{M}_{d\times d}, MM1×dM\in \mathcal{M}_{1\times d}, and σ\sigma an activation function. Let u(x)u(x) be the corresponding neural network, that is

u(x)=M(σ(W2σ(W1x))).()u(x)=M\cdot(\sigma(W^2\cdot\sigma(W^1\cdot x))).\tag{$*$}

We can alternatively understand the network through the recursive formulas

aj1=Wj1x,sj1=σ(aj1),am2=jWmj2sj1,sm2=σ(sm2),u=mMmsm2.\begin{align*}a^1_j&=W^1_jx, \\ s_j^1&=\sigma(a_j^1),\\a_m^2&=\sum_{j}W^2_{mj}s_j^1,\\s_m^2&=\sigma(s_m^2),\\u&=\sum_{m}M_ms_m^2.\end{align*}

This latter formulation makes drawing the computational graph quite easy!

On the left is the forward computation graph corresponding to the recursive definition for our neural network, and on the right is the same graph with the corresponding differentiation operations.

Autodiff then allows us to take this original graph and build a new computational graph for the derivative u(x)u’(x). We do this by tracing the derivative through the graph on the right! By the chain rule applied to ()(*) we have

u(x)=(((usm2sm2am2)am2sj1)sj1aj1)aj1x,u'(x)=\left(\left(\left(\frac{\partial u}{\partial s_m^2}\odot\frac{\partial s_m^2}{\partial a_m^2}\right)\cdot \frac{\partial a_m^2}{\partial s_j^1}\right)\odot \frac{\partial s_j^1}{\partial a_j^1}\right)\cdot \frac{\partial a_j^1}{\partial x},

Where \odot is the Hadamard (element-wise) product and \cdot is the standard dot-product. This same sequence of operations can be computed by following the orange lines up through the computational graph on the right. And here is the key: Doing this generates a new computational graph for the derivative! If we draw this new graph and it is immediately apparent that the functional composition introduces a new dependency which makes it so that we can no longer draw the graph as a simple “line” of nodes:

On the left is the computational graph for the first derivative, and to the right is the computational graph with the derivatives included. We note an important property, when we go to create the computational graph corresponding to the second derivative, the edges in blue will be computed twice, since they were already computed when we computed the forward pass for the zeroth derivative (our original graph). Note that this forward pass is required due to the nonlinear functional composition with σ\sigma: we need to compute the activations aj1a_j^1 and am2a_m^2 to compute the derivative uu’.

We present the computational graph for the second derivative and then cease our efforts in drawing these as they get increasingly complicated. Remember that you must resist the urge to draw the “correct” graph without the repeated computation, since the computer does not know that it already computed these values! We draw the re-computed values in red in the graph below.

Exponential Growth of Computational Graph of Derivatives:
Consider a neural network of depth
d>1d > 1. We notice from the above that the we can count the number of nodes by doubling all the nodes between the top order and bottom order derivatives. Thus we get the following sequence of node counts:

First:

2doriginal+2(d+1)new nodes=4d+2\underbrace{2 * d}_{\text{original}}+\underbrace{2*(d+1)}_{\text{new nodes}}=4d+2

Second:

2doriginal+22dRepeated First Deriv+2(d+1)new nodes=8d+2\underbrace{2*d}_{\text{original}} + \underbrace{2*2*d}_{\text{Repeated First Deriv}}+\underbrace{2*(d+1)}_{\text{new nodes}} = 8d+2

Third:

2doriginal+42dRepeated First Deriv+22dRepeated Second Deriv+2(d+1)new nodes=16d+2\underbrace{2*d}_{\text{original}} + \underbrace{4*2*d}_{\text{Repeated First Deriv}}+\underbrace{2*2*d}_{\text{Repeated Second Deriv}}+\underbrace{2*(d+1)}_{\text{new nodes}} = 16d+2
\cdots

That is, # nodes=2n+1d+2\text{\# nodes} = 2^{n+1}d+2, where nn is the number of derivatives and hence the computational graph will grow like O(2n+1)\mathcal{O}(2^{n+1}).


WARNING I: The actual computational graph computed by an autograd library will be much, much bigger, but this only changes the pre-factor. There are also graph optimizations that bring this runtime down a bit. I estimate that a good asymptotic bound is O(en+1)\mathcal{O}(\sqrt{e}^{n+1}) (see the figure below), which means the graph doesn’t actually fully double at each step.


WARNING II: The computational runtime to evaluate the graph will grow even faster, but still exponentially. This is because in our formulation the number of nodes does not necessarily scale linearly with FLOPs (see the note below on precise runtimes).

We can easily see this by simply plotting the number of nodes in a computational graph using a popular autodiff framework like PyTorch or Jax (see below).

This figure shows empirically the exponential blowup in the size of the computational graph for a simple depth 3 neural network as a function of the number of derivatives we take.

And here is the main takeaway of why autodiff is the wrong tool for computing these derivatives (quickly):

Functional evaluation introduces extra dependencies in the computational graph of the derivative which result in additional computation needing to be done (recomputing derivatives). This leads to an exponential blowup in both memory and runtime when using autodiff to take high-order derivatives.

The TangentProp Formalism

For my generation of ML practitioners it is easy to forget that neural networks proceeded autodiff, and while the current era of deep learning is predicated on autodiff, there were almost 30 years of impressive neural network research prior to the introduction of autodiff.

It is easy to ignore the water you swim in, and it is hard to imagine that there are alternative methods to autodiff for computing derivatives! We will present a differentiation method which keeps track of repeated computation by unrolling the autograd graph and avoiding repeated computation, and was developed prior to the widespread adoption of autodiff.

We base our work off the TangentProp formalism introduced by Simard et al. in 1991. Written before the widespread adoption of autodiff in training neural networks, they propose a formalism for computing both ff and xf\nabla_{\bm{x}} f in a single forward pass. Rather than compute and evaluate the computation graphs using autograd, we can explicitly form a single computational graph since we know the explicit dependencies between derivatives. In particular, the activation at layer \ell is given by

ai=jwijσ(aj1),(3)a^\ell_i = \sum_{j}w_{ij}\sigma(a^{\ell - 1}_j),\tag{3}

and we can directly differentiate this with respect to the network inputs to obtain

xmai=jwijσ(aj1)xmaj1.(4)\partial_{x_m}a_i^\ell = \sum_jw_{ij}\sigma'(a_j^{\ell-1})\partial_{x_m}a_{j}^{\ell - 1}. \tag{4}

Note that as we compute nodes in the DAG, the activations for the neural network and its physical derivative at layer \ell depend only on the activation (and its derivative) from the previous layer. Thus we can compute the physical derivatives in a single forward pass.

Our Approach: nn-TangentProp

The TangentProp paper computes a single derivative because for their use-case there was no reason to compute higher-order derivatives. However the principle that physical derivatives can be computed recursively, in a single forward pass, is not restricted to taking a single derivative. Thus, we propose nn-TangentProp, which extends TangentProp to multiple derivatives by computing xnf\nabla_{\bm{x}}^n f in a single forward pass. Note that if we can accomplish this task we would obtain a runtime of O(nM)\mathcal{O}(nM), which beats the autodiff runtime of O(2nM)\mathcal{O}(2^nM).

Most readers likely learned the Leibniz’s formula for taking higher-order derivatives of the product of two CnC^n functions

dndxn(fg)=j=0nCj,nf(j)g(nj),\frac{d^n}{dx^n}(fg) = \sum_{j=0}^n C_{j,n}f^{(j)}g^{(n-j)},

where the constants Cj,nC_{j, n} are readily computable. Leibniz’s formula is the natural generalization of the product rule to higher dimensions, and Faà di Bruno’s formula is the lesser known generalization of the chain rule. In particular we have that

dndxn(fg)=pP(n)Cp(f(p)g)j=1n(g(j))pj,\frac{d^n}{dx^n}(f\circ g) = \sum_{\bm{p}\in \mathcal{P}(n)}C_{\bm{p}}(f^{(|\bm{p}|)}\circ g)\prod_{j=1}^{n}(g^{(j)})^{p_j},

where the summation is taken over the set P(n)\mathcal{P}(n) of partitions of nn, whereby p\bm{p} is an nn-tuple satisfying kkpk=n\sum_k k p_k = n. We denote p=kpk|\bm{p}| = \sum_k p_k, and note that CpC_{\bm{p}} is readily computable, albeit more complicated than the constants appearing in Leibniz’s formula.

Using Faà di Bruno’s formula it becomes clear how to extend (4)(4) to higher-order derivatives! Applying Faà di Bruno’s formula to (3)(3) gives

xmnai=jwij(ξ(n))j1,(ξ(n))j=pCpσp(aj1)k=1m((ξ(k))j1)pk.\partial_{x^m}^n a^\ell_i = \sum_jw_{ij}^\ell(\xi^{(n)})_j^{\ell - 1}, \qquad (\xi^{(n)})^{\ell}_j = \sum_{\bm{p}}C_{\bm{p}}\sigma^{|\bm{p}|}(a_j^{\ell-1})\prod_{k=1}^m\left((\xi^{(k)})_j^{\ell-1}\right)^{p_k}.

Thus, once implemented in code, we will be able to compute nn derivatives in (quasi)linear time. This implementation is not particularly difficult, and makes for a good exercise in a ML programming. We of course made several more advanced implementation choices, but at its core, the method is quite simple. We have our complete code posted on GitHub.

Finally, we leave the reader with the computational graphs for the second derivative computed via nn-TangentProp and ask that you compare this with the graph we drew for the second derivative above.

As you can see, nothing is repeated. We have managed to construct the “correct” computational graph instead of the complicated graph generated by autodiff.

While we did not implement these optimizations, we can see from the graph above that the upper portion of this computational graph (shown in red) can be pruned since the inputs will always be zero. Similarly, One can prune the columns of the lower portion (shown in blue) which correspond to unnecessary derivatives. However this pruning does not materially change the computation being done.

A Note on the Precise Asymptotic Runtimes of Autograd and nn-TangentProp

Our algorithm isn’t quite linear, and autograd isn’t quite exponential. nn-TangentProp is slightly super-linear (or sub-exponential), and autograd is slightly super-exponential. The reason is that the summation appearing above in Faà di Bruno’s formula introduces a non-trivial dependence on nn. Since the summation occurs over integer partitions of size nn, we must count the number of possible partitions. Luckily for us, this problem has interested combinatorialists for over a century, and they have introduced and studied the partion function p(n)p(n) defined via

p(n):=# of integer partions of size n.p(n):=\text{\# of integer partions of size $n$}.

The classical result of Hardy and Ramanujan provides a tight asymptotic bound for p(n)p(n) of the form

p(n)=O(enn).p(n)=\mathcal{O}\left(\frac{e^{\sqrt{n}}}{n}\right).

This root-exponential dependence on nn cannot be avoided by any differentiation algorithm, it is an unavoidable consequence of differentiation through function composition, which necessarily requires Faà di Bruno’s formula. Note than the the more exact runtimes for autodiff and nn-TangentProp are

ennMn,andenM\frac{e^{\sqrt{n}}}{n}M^n, \quad \text{and} \quad e^{\sqrt{n}}M

respectively. Clearly for large nn the runtime of nn-TangentProp is faster. In particular autodiff is super-exponential and nn-TangentProp is sub-exponential (see the figure below).

Demonstration of the differences in runtime between the functions enn2n\frac{e^{\sqrt{n}}}{n}2^n in red, 2n+12^n+1 in green, and 2en2e^{\sqrt{n}} in blue. The red curve is analagous to the runtime of autodiff, the green curve is analagous to a pure exponential runtime, and the blue curve is analagous to nn-TangentProp. Plotted in log-linear coordinates. Note that 2en2n2e^{\sqrt{n}}\geqslant 2n for all nNn \in \mathbb{N}.

Results

How much faster is nn-TangentProp than autodiff in practice? We ran several numerical experiments to quantify the performance increase with a special focus on computing the self-similar profiles for Burgers equation. The overarching finding is that, perhaps unsurprisingly, as the number of derivatives grows the exponential vs. linear scalings of these two methods begin to play a larger and larger role in the overall runtimes.

Results for General Neural Networks

The following plots show the speed comparisons for simple forward and backward passes through a network (i.e. not training yet).

Average runtime for a combined forward and backwards pass using
autodifferentiation (
red) and nn-TangentProp (blue). The top and bottom frames show the same data, however the bottom frame is plotted with a logarithmic yy-axis. Each model is run 100 times and the average for each trial is plotted. The network has 3 hidden layers of 24 neurons each, a common PINN architecture. The batch size is 28=2562^8 = 256 samples.
Forward pass times for the experiment shown above.
Backwards pass times for the experiment shown above. Observe that at low numbers of derivatives PyTorch optimizations render autodiff backwards faster than a backwards pass through nn-TangentProp.
The ratio of combined forward-backward pass run times between autodifferentiation and nn-TangentProp for a variety of network architectures, input batch sizes, and number of derivatives. A ratio greater than 1 indicates that nn-TangentProp was faster than autodifferentiation. The baseline ratio of 1 is plotted as a horizontal dashed line. All plotted data points represent the average of 100 trials.
The forward pass times for the model from above.

Results for a Model Problem

To illustrate our method in a “real-world” application we consider the problem of finding smooth self-similar solution to Burgers equation. Burgers equation is the canonical model for 1D shock formation and is given by the PDE

tu+uxu=0.(6)\partial_t u + u\partial_x u = 0. \tag{6}

One way to analyze shock solutions is to convert this equation into self-similar coordinates via the transformation

u(x,t)=(1t)λU(X),X=x(1t)1+λ.u(x, t)=(1-t)^\lambda U\left(X\right), \quad X=\frac{x}{(1-t)^{1+\lambda}}.

Under this change of coordinates the PDE (6)(6) becomes an ODE

λU+((1+λ)X+U)U=0.(7)-\lambda U + ((1+\lambda)X+U)U' = 0.\tag{7}

with the scalar valued parameter R>0\bm{R}^{>0}.

This problem is not difficult to solve. It is an “advanced” exercise in an undergraduate course on differential equations and is solved by the implicit formula

X=UCU1+1λ.X = -U-CU^{1+\frac{1}{\lambda}}.

However the numerical techniques developed to solve this problem can be applied to much more complicated and challenging problems for which implicit formulas for the solution do not exist.

To this end, we follow Wang et al. and deploy a PINN to solve this problem. A PINN is preferable here to a line search over λ\lambda since gradient descent will find both UU and λ\lambda simultaneously. Furthermore, λ\lambda is determined by a smoothness constraint at the origin and since PINNs are automatically CC^\infty, evaluating high-order derivatives at the origin becomes easy to implement.

Results from training a PINN to solve (7)(7) with λ\lambda constrained to the range [1/5,1/3][1/5, 1/3]. The only smooth solution to (7) contained in this parameter range corresponds to λ=14\lambda=\frac{1}{4}. The first four rows show our learned solution (dashed red) and its derivatives compared to the true solution (solid blue). The second to last row shows the PINN training loss as a function of epochs. The model was trained for 15k epochs using the Adam optimizer and 30k epochs using the L-BFGS optimizer. The bottom row shows the inferred value for the parameter λ\lambda as a function of epochs. The bottom two rows are plotted with a logarithmic yy-axis.
Results from training a PINN to solve (7)(7) with λ\lambda constrained to the range [1/7,1/5][1/7, 1/5]. The only smooth solution to (7) contained in this parameter range corresponds to λ=16\lambda=\frac{1}{6}. The first four rows show our learned solution (dashed red) and its derivatives compared to the true solution (solid blue). The second to last row shows the PINN training loss as a function of epochs. The model was trained for 15k epochs using the Adam optimizer and 30k epochs using the L-BFGS optimizer. The bottom row shows the inferred value for the parameter λ\lambda as a function of epochs. The bottom two rows are plotted with a logarithmic yy-axis.
Results from training a PINN to find the first smooth profile for Equation (7)(7). The model is trained for 15k epochs using the Adam optimizer and 30k epochs using L-BFGS. The top panel reports our training losses, the middle panel reports λ\lambda as a function of epochs, and the bottom panel shows the ratio in runtime between autodifferentiation and nn-TangentProp as a function of number of epochs. The bottom panel shows that nn-TangentProp is 2.5x faster for end-to-end training than autodifferentiation. The horizontal line in the bottom panel indicates a runtime ratio of 1.

We observe that the widespread use of L-BFGS amplifies the benefits afforded by nn-TangentProp because the L-BFGS algorithm runs many forward passes but only a single backwards pass. As we saw in the previous section of our results, the gains afforded by nn-TangnetProp are more significant for forward pass than for backwards passes. Thus we see from above that during L-BFGS cycles nn-TangentProp begins vastly outperforming autodiff.

We only computed the first two profiles using autodiff. The reason is that already for computing the third profile using autodiff we needed more than 24 hours of compute time on an A6000 GPU, which went beyond our allowable compute budget. With nn-TangentProp this profile took a little under 1 hour, meaning that we had a speed up of over 24x. For the fourth profile we estimate that autodiff would take about 100 hours (around four days), and we were able to compute it on our computer in under an hour and a half, giving more than a 65x speed-up! Keep in mind that we only implemented minor optimizations for our code.

Conclusions

Based on our findings we recommend that any PINN use case requiring taking three or more derivatives implement nn-TangentProp to avoid needlessly wasting compute.

Consequences of this line of research include:

  1. Democratization of PINN research by allowing researchers without access to high-compute to study PINN problems.
  1. Enabling new avenues of study which require taking high-order derivatives of PINNs. In particular we can easily enforce smoothness and thus compute previously unknown unstable self-similar solutions.
  1. Providing more accurate solutions with less compute. By enabling more rapid computation of high-order derivatives we can now enforce successively higher Sobolev losses which will lead to more accurate PINN solvers.

Future Directions

There are several key directions that we could take this work in the future. I am not sure if I will take up the challenge, but if anyone is interested in these as projects please reach out and I am happy to help guide your work.

Of particular interest to me are the following two extensions:


👋🏻

About the author: Kyle R. Chickering is a PhD student in the department of mathematics at UC Davis. He holds a BS and MS in applied mathematics from UC Davis, and has worked in analytic shock formation, numerical methods using machine learning, and more recently, on large vision-language models. Outside of academia he has worked on teams building generative AI for video and has focused on efficient algorithms for machine learning. He lives with his wife, two dogs, and their cat in Davis, CA.