Skip to content

Introduction

Overview of Project

This ReCoDE Project on Neural Ordinary Differential Equations will walk you through the theoretical basics of Ordinary Differential Equations (ODE), specifically in the context of numerical solvers, all the way to Neural ODEs. The topics we'll cover will include answering what ODEs are, how to implement their numerical solution in research code, how to utilise an autograd equipped library to propagate gradients through a integration routine, and, most importantly, how one can then subsequently use these developed tools to tackle problems with Neural Networks. While it looks like a lot of ground to cover, a lot of the mathematical machinery should be familiar to you, and hopefully through this project you will learn how to structure a distributable python research module, how to take theoretical ideas like backpropagation through an integrator and apply it to Neural Networks and most importantly, you will have fun.

Neural Ordinary Differential Equations

Neural ODEs combines Neural Networks with ODEs. Broadly, NNs are a class of models where a (usually statistical) mapping between an input space and an output space can be learned. Differential Equations (DE) relate a function and its derivative(s), and ODEs are a subset of DEs where there is only one independent variable.

In less vague terms, Neural Networks (NNs) model functions using a variety of nonlinear transformations, a simple feed-forward model uses affine projections combined with nonlinear transformations whereas a network that processes images may use convolutions. An example of this is a "Digit Classifier" that learns the mapping from images of digits (the input space) to the numeric value of the digit (the output space).

The input and output space could also be in relation to a sequence of some kind. For example, the input space could be a text fragement and the output space could be the distribution of probable letters that can succeed the input. Given an incomplete sentence "The dog jumps ove", a neural network could learn the mapping and predict the letter "r" as the output giving "The dog jumps over".

A more dynamical example is a falling ball. A ball dropped from some height will accelerate towards the ground until impact. Perhaps what we can measure is just the height of the ball at each point in time and we drop many balls from different heights with different masses, maybe even different sizes to create a "Ball-Drop" dataset. This dataset would have some input labels (the height from which the balls is dropped, the mass of the ball, the size of the ball) and output labels (the time of impact). We could train a neural network to predict the time of impact directly, but we could also ask the exact state of the ball at any given time between the drop and the impact? We'd need a huge dataset to get a dense sampling, and we would have to hope that the neural network gives reasonable predictions for times that we don't have the data for.

Instead, what if we learned the dynamics itself, the mapping between the current state and some future state. We could iteratively estimate the state of the ball by first putting in the initial state, getting the next state (prescribed by the data sampling) and then re-inputting this next state back into the network until the network says the ball hit the ground. But we're still restricting to our data sampling, if we have sampled at 0.1s increments, then our network won't know how to predict at 0.05s increments because there is no data. Ideally we'd be able to sample continuously, but in this scheme each finer increment requires increasingly more data to the point where simply storing this data becomes impractical.

This is where DEs come in. DEs can be used to model the dynamics of some kind of system, this could be chemical, physical, etc., but ultimately the DEs we're interested in describe the time-evolution of a time-dependent dynamical system in a continuous fashion. Hence, the goal with Neural ODEs is to learn this dynamical mapping from data and perhaps even control it. We will focus on ODEs since they are the simplest DEs that can be dealt in a time-continuous fashion and do not have the complexity associated with other DEs such as Partial DEs, Stochastic DEs and Differential-Algebraic Equations (DEs but with algebraic constraints).

What are ODEs?

ODEs are a class of DEs where the flow is dependent on only a single independent variable1. Formally, an ODE is a differential equation of the following form2: 3 This expression looks somewhat daunting given the number of terms so let's simplify the notation slightly. First, let where we implicitly assume that is the independent variable. Let's also restrict our attention to explicit ODEs given that many systems commonly take this form and we'll also restrict our attention to first-order systems as higher-order explicit ODEs can be rewritten as first-order systems: In this form, we can more clearly see what an ODE tells us, if we consider as a point in , then the ODE gives the flow map . Were we to follow this map we would be given a trajectory in the space and it is this trajectory that we aim to compute using the methods described in this section.

Example System of a Falling Ball

For example, the equation for a falling ball takes the form where is the height of the ball and is standard gravitational acceleration. We can rewrite this as a first-order equation by introducing velocity :

We can write this more compactly as:

And if we write and we can see how this notation is equivalent.

You'll note that this information alone is not enough to deduce the values of and , because we lack boundary conditions, and, more specifically in this case, initial values. The height of the ball will differ depending on if the ball was thrown at the ground and what height it was let go at, namely the values and .

Initial Value Problems (IVPs)

Initial Value Problems are a class of problems where the values of the differential equation are defined at , at the "initiation" of the system. There are a separate class of problems called Boundary Value Problems (BVPs) where the values of the system are defined at both and for some (if is the independent variable), but these are not the systems of interest to us as they are solved using methods different to the ones found here5. In this work, we are seeking to solve IVPs of the form

Numerical Integration Techniques

Numerical integration can refer to many different techniques ranging from the Trapezoid Rule (a Quadrature method) to Monte-Carlo methods to Runge-Kutta methods, but broadly, these are all methods to solve integrals of different kinds. Quadrature techniques compute definite integrals where the Boundary Conditions are defined exactly (ie. the values of the function and/or derivative on the integration boundaries are known) by subdividing the domain of integration; unlike ODEs, these integrals are not restricted to one continous dimension. Monte-Carlo methods compute multi-dimensional integrals using a probabilistic approach and solve the issues of poor scaling experienced in Quadrature methods which subdivide -dimensions into parts leading to points of evaluation. Here we focus solely on ODEs and IVPs.

Many methods exist for numerically integrating DEs, depending on the type some tools are more suitable than others, but we will entirely restrict ourselves to IVPs. First, let's define some more notation. Let be the interval of integration such that we restrict our interest to only a subset of the values of 4 where our initial values are defined as . With this machinery setup, we can look at the most simple integration method: The (Explicit/Forward) Euler Method.

The (Explicit/Forward) Euler Method

If we recall that , then we could consider rewriting the differentials as: Now represents an infinitesimal increment of time and given calculus we know that represents the infinitesimal change in , so if we could somehow start with and sum each infinitesimal change, we would be able to estimate the trajectory . But this unfortunately doesn't work because computers do not handle infinities and infinitesimals in a way that would allow us to compute the above sum. So maybe if we relax the equation a little bit6, and suppose that actually the infinitesimal change in time is no longer infinitesimal but some measurable delta and make the transformation , then perhaps we could sum the measurable changes to obtain the trajectory . To simplify the notation, we'll assume that and this will split our interval into evenly sized chunks in which we can obtain the discrete mapping of where and is our discrete approximation to 715 So now we have defined a very simple integrator where given some initial values and an interval of integration, we can, in principle compute the trajectory of a system. Writing this method out formally, we have that You will note that the next timestep depends only on the previous timestep in this method and that is why this is called the Explicit or Forward Euler Method. While this method works, an error analysis using Taylor series shows that the error in this approximation is a linear function of (ie. an algorithm where ). A computer has finite precision, and so if a step size of gives the solution to a precision of then obtaining a solution with a precision that is machine precision ( for 32-bit floats and for 64-bit floats), then you would require a step size that is 6 orders of magnitude smaller (or 14 orders for 64-bit doubles) to get to such precision. This translates to more steps and correspondingly, times more computation. So while Euler's Method is easy to understand from a variety of perspectives, it has terrible scaling with the step size and while it may handle some systems well, it is not a good general purpose integrator for almost any system. Aside from errors, these methods also have stability considerations and many systems can show instability when the approximation is too inaccurate. More formally, instability here refers to a tendency for the approximation and the ideal trajectory diverging8 Euler's Method is not the most stable method especially for systems which can be described as "stiff". While there is no formal mathematical definition that accounts for all stiff systems, the general behaviour of a stiff system is that a method with a finite region of stability takes excessively small steps relative to the smoothness of the underlying solution17.

Explicit vs Implicit Methods

You'll recall that the Euler Method above was labelled as being Explicit/Forward, and this was to highlight that there is a variation of Euler's Method where And this defines an algebraic equation given a general where the solution must be found iteratively. This is commonly referred to as the Implicit/Backward Euler Method as the solution is implicitly defined. This method, while having the same error properties, generally exhibits better stability16 especially for stiff systems, but requires significantly more computation to solve.

Runge-Kutta Methods and Butcher Tableaus

While there exist different methods and approaches than the Forward and Backward Euler Methods, such as linear multi-step methods, we will restrict our attention to Runge-Kutta Methods due to their ease of implementation and conceptually straightforward description. Let be the order of a given Runge-Method and be the number of steps, the aforementioned Forward Euler method would be described as an order , (as there is only one step in the approximation) Explicit Runge-Kutta method. Runge-Kutta methods assume a form where the approximation at each timestep9, is given by where is henceforth assumed to be in reference to the approximated trajectory (unless stated explicitly), and For explicit methods , while for implicit methods there is no restriction. Our Explicit Euler Method, in this notation, has only one set of coefficients, namely , and . For the Implicit Euler Method, the coefficients are almost the same except now . As you can see, this is a powerful notation for writing out different integration schemes as it will help appropriately track coefficients when we're writing our own integrators. If we denote the matrix of coefficients , and the vectors of coefficients and , then we can write a general Runge-Kutta method as follows referred to as a Butcher Tableau.

Summary

In summary, we have seen that there are methods of integrating ODEs called Explicit Runge-Kutta methods that approximate the trajectory using local estimates of the state, and we'll be implementing them using their Butcher Tableaus. The goal is to code a pipeline where, starting from data relating to a dynamical system, we can train a neural network to learn the dynamics of a system.

Appendix 1 - Adaptive vs Fixed-Step

Up to this point we've discussed numerical integrators that use a fixed timestep in order to estimate a given trajectory, but let's consider a 1d function where the solution has two timescales, one long and one short such as where we'll assume and . The ODE for this system would be We can see that the solution timescale for the exponential is whereas the timescale for the oscillation is . If we were to numerically integrate this using a timestep of ; when , we would introduce spurious oscillations in the solution as we'd overestimate the decay in value of , if then these oscillations would be too small in magnitude and the cosine/sine terms would dominate the solution, and so we would need to choose a timestep in accordance with their oscillation.

Multiple timescales exist in many systems and require careful consideration depending on the system being observed. Ideally, we would resolve all timescales by choosing a "small enough" timestep when is large and "large enough" when is small. The idea of "small enough" and "large enough" depend strongly on the system in question, and so we must carefully consider how to measure this while integrating a given system.

One idea, commonly used in Runge-Kutta methods, is the introduction of an auxiliary estimate of the trajectory, ideally one with a higher order error than the baseline method used for estimation. Then, assuming that the system is well-resolved (i.e. our timestep hits the "small/large enough" criterion), the difference between the two estimates of the trajectory should be similarly "small". Thus, one can consider a timestep adaptation strategy that "controls" the discrepancy between the two estimates.

Consider the following Runge-Kutta method that has both a 4th order and a 5th order estimate of the trajectory:

We see that there are now two rows where before we had one row of coefficients, the first row is the () order estimate and the second row is the () order estimate (where ).10 18.

Since the estimate is given by , we can see that the estimated error is given by where are the coefficients for the higher order estimate.

A detailed derivation is in the Appendix in Derivation of Error Adaptation, but suffice to say, we can change the timestep based on the error using the equation:

where is the absolute error tolerance and is the relative error tolerance.

Appendix 2 - Deriving the Error Adaptation Equations

The error is for each component of with a sign that indicates over-/under-estimation relative to the higher-order estimate. Generally, we are concerned with keeping the magnitude of the error small and are not particularly concerned about over-/under-estimation since, if the magnitude is sufficiently small, the sign becomes irrelevant. Noting that the timestep is a scalar quantity, we must derive a scalar measure of the error from this vector estimate. There are several ways of calculating the magnitude of a vector such as the norm, the norm and the norm. By writing these as -norms, we can leave the choice of until later and first write down how we can adjust the timestep.

Consider a trajectory where a scalar , an error of is a 1% deviation from the ground truth. If this was an estimate of distance in meters, the magnitude of the error would be on the order of centimeters, a measurable and noteworthy quantity. Now let's consider a change of units from meters to nanometers, our trajectory now becomes . An error of would now be on the order of one hundredth of a nanometer, for a system where the changes occur on the order of meters, this would be an insignificant change. If our units were kilometers, then and our error is larger than the quantities we're estimating. If we consider the problem in reverse, targeting an error of would be unnecessarily precise (except in perhaps rare circumstances) or critical depending on the scale of the problem.

This motivates the introduction of two concepts: absolute tolerance (\sigma_a) and relative tolerance (\sigma_r). The absolute tolerance is the maximum deviation we would allow irrespective of the scale of the problem, whereas the relative tolerance is the maximum deviation we would allow normalised to the scale of the problem. 11

Going back to our error estimate, the ideal condition we would like to fulfil is the following (Loosely following the derivation in 19):

where the subscript denotes the vector components and where we've introduced the scale of the problem through multiplying by . We can rearrange this equation to obtain the following:

And if we consider that the error is actually an estimate of the higher-order error term (i.e. the term in the Taylor expansion of the solution), then we can write the ratio as follows

Where we know that the leading error term is proportional to and (which in turn is dependent on the p+1 derivative of the solution). From this we see the relationship between the tolerances we'd like to achieve and the timestep, but there are still unknown terms and an inequality is not easy to solve.

Consider now that we change the step by a factor to , then the error estimate of the step will be:

Solving for (to the leading order term) we get:

12

If we assume that the higher order terms, , are similar to the higher order terms of the unaltered timestep , then we can write substitute back our estimate of the error to obtain:

Which gives us a way to compute an adjustment to the timestep that obeys our tolerance criteria for each component of our trajectory. In common usage, a safety factor of approximately is incorporated to get the the stricter inequality:

That accounts for any spurious changes in the solution (i.e. mild violations of the equality of the higher order terms between the current step and the new step).

Whenever our original error does not satisfy our inequality, we can reject the step and recalculate it more accurately using a timestep from the following formula. Conversely, whenever the step is accepted, we can use to increase the step and avoid extra computation in solving the system.

13

Up to here we have ignored the fact that is a vector quantity and our inequality is on a per-component basis. Since we cannot take separate steps for each component14, we need to make an adjustment that accounts for all our components simultaneously. One way would be to take some kind of average over the different 's for each component such the arithmetic or geometric mean. Another might be to consider the maximum or minimum, but a simple solution is to convert the error into a scalar over all the components:

Where we now consider the magnitude of our errors in a vector sense. The p-norm, discussed before, is left up to the developer's choice and can even be exposed as part of the integration interface. A suitable choice would be the -norm for the denominator and the -norm for the numerator as this adjusts the error proportional to the maximum error incurred while considering the overall scale of the problem.


  1. In many systems in physics and chemistry, this is time. 

  2. I've used here since many systems we will look at use time as the independent variable, but this could also be which usually denotes a spatial dimension meaning that is some spatially varying function. Mathematically they are equivalent, but I will follow the physics conventions in this project. 

  3. This expression encompasses both explicit and implicit ODEs, where it may be the case that cannot be isolated and thus one cannot find an explicit closed-form expression for it. 

  4. Suppose we want to know what a system does in the limit , then a computer cannot numerically represent this interval in a useful fashion. In that case, we can use a suitable change of variables such that where and integrate as a function of

  5. In fact, I've omitted an entire class of problems with mixed boundary conditions where both initial values and boundary values are specified depending on the components of . These are more common in ODE systems derived from Partial Differential Equations (PDE) where, for example, temperature diffusion through a metal plate may occur where one end of the plate is always at a fixed temperature. 

  6. And here you should be getting echoes of Riemannian Sums. 

  7. If we could somehow take the limit then we would have an exact solution to our system. 

  8. You may ask, how would you measure that if these systems cannot be solved analytically. One method involves linearising some ideal system of equations and looking at how that linear mapping behaves when composed many times while accounting for the error terms introduced by the approximation. In the ideal case, these error terms increase as a linear or sub-linear function of the timestep, but in the unstable case, these may increase in a super-linear or even exponential fashion leading to divergence from the true trajectory. 

  9. Timestep will denote steps in whereas step will denote steps taken to approximate the timestep 

  10. The coefficients are normally given as fractions, but for brevity they have been rounded to 4 decimal places or the repeating portion of the decimal has been indicated with a bar. 

  11. While we've discussed these tolerances as scalar quantities, one could consider specifying them on a per component basis and we will come back to this at a future point. 

  12. This is valid as all quantities are strictly positive and exponentiation is a monotonic function for positive quantities. 

  13. It should be noted that there are other methods of choosing the step size adjustment during a numerical integration procedure including estimation of the global error (here we have only considered the local error incurred in taking one step, but not the cumulative error) and the error per unit step. 

  14. there are methods that take fractional steps for some quantities, but these are beyond the scope of this work and require significant work/bookkeeping to correctly implement 

  15. There are alternative ways of deriving the same result, this is a loose derivation based on intuition, but a more formal approach would be to first take the integral of both sides wrt. from to to obtain and then approximate the integral on the RHS using known quadrature rules such as the Rectangle Rule. 

  16. Butcher, J. C. (John C. (2016). Numerical methods for ordinary differential equations (Third edition.). John Wiley & Sons, Ltd. 

  17. Lambert, J. D. (1991). Numerical methods for ordinary differential systems : the initial value problem. Wiley. 

  18. Bogacki, P., & Shampine, L. F. (1996). An efficient Runge-Kutta (4,5) pair. In Computers & Mathematics with Applications (Vol. 32, Issue 6, pp. 15–28). Elsevier BV. https://doi.org/10.1016/0898-1221(96)00141-1 

  19. Shampine, L. F. (2005). Error estimation and control for ODEs. In Journal of Scientific Computing (Vol. 25, Issues 1–2, pp. 3–16). Springer Science and Business Media LLC. https://doi.org/10.1007/bf02728979