Dealing with constant parameters

In many applications, there may be factors that are relevant to the function evaluations, but are fixed throughout the optimization. An obvious example is using data in a likelihood function, but it could also be parameters we wish to hold constant.

Consider a squared error loss function that depends on some data x and y, and parameters betas. As far as the solver is concerned, there should only be one input argument to the function we want to minimize, call it sqerror.

The problem is that we want to optimize a function sqerror that really depends on three inputs, and two of them are constant throught the optimization procedure. To do this, we need to define the variables x and y

x = [1.0, 2.0, 3.0]
y = 1.0 + 2.0 * x + [-0.3, 0.3, -0.1]

We then simply define a function in three variables

function sqerror(betas, X, Y)
    err = 0.0
    for i in 1:length(X)
        pred_i = betas[1] + betas[2] * X[i]
        err += (Y[i] - pred_i)^2
    end
    return err
end

and then optimize the following anonymous function

res = optimize(b -> sqerror(b, x, y), [0.0, 0.0])

Alternatively, we can define a closure sqerror(betas) that is aware of the variables we just defined

function sqerror(betas)
    err = 0.0
    for i in 1:length(x)
        pred_i = betas[1] + betas[2] * x[i]
        err += (y[i] - pred_i)^2
    end
    return err
end

We can then optimize the sqerror function just like any other function

res = optimize(sqerror, [0.0, 0.0])

Avoid repeating computations

Say you are optimizing a function

f(x) = x[1]^2+x[2]^2
g!(storage, x) = copyto!(storage, [2x[1], 2x[2]])

In this situation, no calculations from f could be reused in g!. However, sometimes there is a substantial similarity between the objective function, and gradient, and some calculations can be reused.

To avoid repeating calculations, define functions fg! or fgh! that compute the objective function, the gradient and the Hessian (if needed) simultaneously. These functions internally can be written to avoid repeating common calculations.

For example, here we define a function fg! to compute the objective function and the gradient, as required:

function fg!(F,G,x)
  # do common computations here
  # ...
  if G != nothing
    # code to compute gradient here
    # writing the result to the vector G
  end
  if F != nothing
    # value = ... code to compute objective function
    return value
  end
end

Optim will only call this function with an argument G that is nothing (if the gradient is not required) or a Vector that should be filled (in-place) with the gradient. This flexibility is convenient for algorithms that only use the gradient in some iterations but not in others.

Now we call optimize with the following syntax:

Optim.optimize(Optim.only_fg!(fg!), [0., 0.], Optim.LBFGS())

Similarly, for a computation that requires the Hessian, we can write:

function fgh!(F,G,H,x)
  G == nothing || # compute gradient and store in G
  H == nothing || # compute Hessian and store in H
  F == nothing || return f(x)
  nothing
end

Optim.optimize(Optim.only_fgh!(fgh!), [0., 0.], Optim.Newton())

Provide gradients

As mentioned in the general introduction, passing analytical gradients can have an impact on performance. To show an example of this, consider the separable extension of the Rosenbrock function in dimension 5000, see SROSENBR in CUTEst.

Below, we use the gradients and objective functions from mastsif through CUTEst.jl. We only show the first five iterations of an attempt to minimize the function using Gradient Descent.

julia> @time optimize(f, initial_x, GradientDescent(),
                      Optim.Options(show_trace=true, iterations = 5))
Iter     Function value   Gradient norm
     0     4.850000e+04     2.116000e+02
     1     1.018734e+03     2.704951e+01
     2     3.468449e+00     5.721261e-01
     3     2.966899e+00     2.638790e-02
     4     2.511859e+00     5.237768e-01
     5     2.107853e+00     1.020287e-01
 21.731129 seconds (1.61 M allocations: 63.434 MB, 0.03% gc time)
Results of Optimization Algorithm
 * Algorithm: Gradient Descent
 * Starting Point: [1.2,1.0, ...]
 * Minimizer: [1.0287767703731154,1.058769439356144, ...]
 * Minimum: 2.107853e+00
 * Iterations: 5
 * Convergence: false
   * |x - x'| < 0.0: false
   * |f(x) - f(x')| / |f(x)| < 0.0: false
   * |g(x)| < 1.0e-08: false
   * Reached Maximum Number of Iterations: true
 * Objective Function Calls: 23
 * Gradient Calls: 23

julia> @time optimize(f, g!, initial_x, GradientDescent(),
                      Optim.Options(show_trace=true, iterations = 5))
Iter     Function value   Gradient norm
     0     4.850000e+04     2.116000e+02
     1     1.018769e+03     2.704998e+01
     2     3.468488e+00     5.721481e-01
     3     2.966900e+00     2.638792e-02
     4     2.511828e+00     5.237919e-01
     5     2.107802e+00     1.020415e-01
  0.009889 seconds (915 allocations: 270.266 KB)
Results of Optimization Algorithm
 * Algorithm: Gradient Descent
 * Starting Point: [1.2,1.0, ...]
 * Minimizer: [1.0287763814102757,1.05876866832087, ...]
 * Minimum: 2.107802e+00
 * Iterations: 5
 * Convergence: false
   * |x - x'| < 0.0: false
   * |f(x) - f(x')| / |f(x)| < 0.0: false
   * |g(x)| < 1.0e-08: false
   * Reached Maximum Number of Iterations: true
 * Objective Function Calls: 23
 * Gradient Calls: 23

The objective has obtained a value that is very similar between the two runs, but the run with the analytical gradient is way faster. It is possible that the finite differences code can be improved, but generally the optimization will be slowed down by all the function evaluations required to do the central finite differences calculations.

Separating time spent in Optim's code and user provided functions

Consider the Rosenbrock problem.

using Optim
prob = Optim.UnconstrainedProblems.examples["Rosenbrock"];

Say we optimize this function, and look at the total run time of optimize using the Newton Trust Region method, and we are surprised that it takes a long time to run. We then wonder if time is spent in Optim's own code (solving the sub-problem for example) or in evaluating the objective, gradient or hessian that we provided. Then it can be very useful to use the TimerOutputs.jl package. This package allows us to run an over-all timer for optimize, and add individual timers for f, g!, and h!. Consider the example below, that is due to the author of the package (Kristoffer Carlsson).

using TimerOutputs
const to = TimerOutput()

f(x    ) =  @timeit to "f"  prob.f(x)
g!(x, g) =  @timeit to "g!" prob.g!(x, g)
h!(x, h) =  @timeit to "h!" prob.h!(x, h)

begin
reset_timer!(to)
@timeit to "Trust Region" begin
    res = Optim.optimize(f, g!, h!, prob.initial_x, NewtonTrustRegion())
end
show(to; allocations = false)
end

We see that the time is actually not spent in our provided functions, but most of the time is spent in the code for the trust region method.

Early stopping

Sometimes it might be of interest to stop the optimizer early. The simplest way to do this is to set the iterations keyword in Optim.Options to some number. This will prevent the iteration counter exceeding some limit, with the standard value being 1000. Alternatively, it is possible to put a soft limit on the run time of the optimization procedure by setting the time_limit keyword in the Optim.Options constructor.

using Optim
problem = Optim.UnconstrainedProblems.examples["Rosenbrock"]

f = problem.f
initial_x = problem.initial_x

function slow(x)
    sleep(0.1)
    f(x)
end

start_time = time()

optimize(slow, zeros(2), NelderMead(), Optim.Options(time_limit = 3.0))

This will stop after about three seconds. If it is more important that we stop before the limit is reached, it is possible to use a callback with a simple model for predicting how much time will have passed when the next iteration is over. Consider the following code

using Optim
problem = Optim.UnconstrainedProblems.examples["Rosenbrock"]

f = problem.f
initial_x = problem.initial_x

function very_slow(x)
    sleep(.5)
    f(x)
end

start_time = time()
time_to_setup = zeros(1)
function advanced_time_control(x)
    println(" * Iteration:       ", x.iteration)
    so_far =  time()-start_time
    println(" * Time so far:     ", so_far)
    if x.iteration == 0
        time_to_setup[:] = time()-start_time
    else
        expected_next_time = so_far + (time()-start_time-time_to_setup[1])/(x.iteration)
        println(" * Next iteration ≈ ", expected_next_time)
        println()
        return expected_next_time < 13 ? false : true
    end
    println()
    false
end
optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control))

It will try to predict the elapsed time after the next iteration is over, and stop now if it is expected to exceed the limit of 13 seconds. Running it, we get something like the following output

julia> optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control))
 * Iteration:       0
 * Time so far:     2.219298839569092

 * Iteration:       1
 * Time so far:     3.4006409645080566
 * Next iteration ≈ 4.5429909229278564

 * Iteration:       2
 * Time so far:     4.403923988342285
 * Next iteration ≈ 5.476739525794983

 * Iteration:       3
 * Time so far:     5.407265901565552
 * Next iteration ≈ 6.4569235642751055

 * Iteration:       4
 * Time so far:     5.909044027328491
 * Next iteration ≈ 6.821732044219971

 * Iteration:       5
 * Time so far:     6.912338972091675
 * Next iteration ≈ 7.843148183822632

 * Iteration:       6
 * Time so far:     7.9156060218811035
 * Next iteration ≈ 8.85849153995514

 * Iteration:       7
 * Time so far:     8.918903827667236
 * Next iteration ≈ 9.870419979095459

 * Iteration:       8
 * Time so far:     9.922197818756104
 * Next iteration ≈ 10.880185931921005

 * Iteration:       9
 * Time so far:     10.925468921661377
 * Next iteration ≈ 11.888488478130764

 * Iteration:       10
 * Time so far:     11.92870283126831
 * Next iteration ≈ 12.895747828483582

 * Iteration:       11
 * Time so far:     12.932114839553833
 * Next iteration ≈ 13.902462200684981

Results of Optimization Algorithm
 * Algorithm: Nelder-Mead
 * Starting Point: [0.0,0.0]
 * Minimizer: [0.23359374999999996,0.042187499999999996, ...]
 * Minimum: 6.291677e-01
 * Iterations: 11
 * Convergence: false
   *  √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: false
   * Reached Maximum Number of Iterations: false
 * Objective Function Calls: 24