Dealing with constant parameters
In many applications, there may be factors that are relevant to the function evaluations, but are fixed throughout the optimization. An obvious example is using data in a likelihood function, but it could also be parameters we wish to hold constant.
Consider a squared error loss function that depends on some data x
and y
, and parameters betas
. As far as the solver is concerned, there should only be one input argument to the function we want to minimize, call it sqerror
.
The problem is that we want to optimize a function sqerror
that really depends on three inputs, and two of them are constant throughout the optimization procedure. To do this, we need to define the variables x
and y
x = [1.0, 2.0, 3.0]
y = 1.0 .+ 2.0 .* x .+ [-0.3, 0.3, -0.1]
We then simply define a function in three variables
function sqerror(betas, X, Y)
err = 0.0
for i in 1:length(X)
pred_i = betas[1] + betas[2] * X[i]
err += (Y[i] - pred_i)^2
end
return err
end
and then optimize the following anonymous function
res = optimize(b -> sqerror(b, x, y), [0.0, 0.0])
Alternatively, we can define a closure sqerror(betas)
that is aware of the variables we just defined
function sqerror(betas)
err = 0.0
for i in 1:length(x)
pred_i = betas[1] + betas[2] * x[i]
err += (y[i] - pred_i)^2
end
return err
end
We can then optimize the sqerror
function just like any other function
res = optimize(sqerror, [0.0, 0.0])
Avoid repeating computations
Say you are optimizing a function
f(x) = x[1]^2+x[2]^2
g!(storage, x) = copyto!(storage, [2x[1], 2x[2]])
In this situation, no calculations from f
could be reused in g!
. However, sometimes there is a substantial similarity between the objective function, and gradient, and some calculations can be reused.
To avoid repeating calculations, define functions fg!
or fgh!
that compute the objective function, the gradient and the Hessian (if needed) simultaneously. These functions internally can be written to avoid repeating common calculations.
For example, here we define a function fg!
to compute the objective function and the gradient, as required:
function fg!(F, G, x)
# do common computations here
# ...
if G !== nothing
# code to compute gradient here
# writing the result to the vector G
# G .= ...
end
if F !== nothing
# value = ... code to compute objective function
return value
end
end
Optim
will only call this function with an argument G
that is nothing
(if the gradient is not required) or a Vector
that should be filled (in-place) with the gradient. This flexibility is convenient for algorithms that only use the gradient in some iterations but not in others.
Now we call optimize
with the following syntax:
Optim.optimize(Optim.only_fg!(fg!), [0., 0.], Optim.LBFGS())
Similarly, for a computation that requires the Hessian, we can write:
function fgh!(F, G, H, x)
G === nothing || # compute gradient and store in G
H === nothing || # compute Hessian and store in H
F === nothing || return f(x)
nothing
end
Optim.optimize(Optim.only_fgh!(fgh!), [0., 0.], Optim.Newton())
Provide gradients
As mentioned in the general introduction, passing analytical gradients can have an impact on performance. To show an example of this, consider the separable extension of the Rosenbrock function in dimension 5000, see SROSENBR in CUTEst.
Below, we use the gradients and objective functions from mastsif through CUTEst.jl. We only show the first five iterations of an attempt to minimize the function using Gradient Descent.
julia> @time optimize(f, initial_x, GradientDescent(),
Optim.Options(show_trace=true, iterations = 5))
Iter Function value Gradient norm
0 4.850000e+04 2.116000e+02
1 1.018734e+03 2.704951e+01
2 3.468449e+00 5.721261e-01
3 2.966899e+00 2.638790e-02
4 2.511859e+00 5.237768e-01
5 2.107853e+00 1.020287e-01
21.731129 seconds (1.61 M allocations: 63.434 MB, 0.03% gc time)
Results of Optimization Algorithm
* Algorithm: Gradient Descent
* Starting Point: [1.2,1.0, ...]
* Minimizer: [1.0287767703731154,1.058769439356144, ...]
* Minimum: 2.107853e+00
* Iterations: 5
* Convergence: false
* |x - x'| < 0.0: false
* |f(x) - f(x')| / |f(x)| < 0.0: false
* |g(x)| < 1.0e-08: false
* Reached Maximum Number of Iterations: true
* Objective Function Calls: 23
* Gradient Calls: 23
julia> @time optimize(f, g!, initial_x, GradientDescent(),
Optim.Options(show_trace=true, iterations = 5))
Iter Function value Gradient norm
0 4.850000e+04 2.116000e+02
1 1.018769e+03 2.704998e+01
2 3.468488e+00 5.721481e-01
3 2.966900e+00 2.638792e-02
4 2.511828e+00 5.237919e-01
5 2.107802e+00 1.020415e-01
0.009889 seconds (915 allocations: 270.266 KB)
Results of Optimization Algorithm
* Algorithm: Gradient Descent
* Starting Point: [1.2,1.0, ...]
* Minimizer: [1.0287763814102757,1.05876866832087, ...]
* Minimum: 2.107802e+00
* Iterations: 5
* Convergence: false
* |x - x'| < 0.0: false
* |f(x) - f(x')| / |f(x)| < 0.0: false
* |g(x)| < 1.0e-08: false
* Reached Maximum Number of Iterations: true
* Objective Function Calls: 23
* Gradient Calls: 23
The objective has obtained a value that is very similar between the two runs, but the run with the analytical gradient is way faster. It is possible that the finite differences code can be improved, but generally the optimization will be slowed down by all the function evaluations required to do the central finite differences calculations.
Separating time spent in Optim's code and user provided functions
Consider the Rosenbrock problem.
using Optim, OptimTestProblems
prob = UnconstrainedProblems.examples["Rosenbrock"];
Say we optimize this function, and look at the total run time of optimize
using the Newton Trust Region method, and we are surprised that it takes a long time to run. We then wonder if time is spent in Optim's own code (solving the sub-problem for example) or in evaluating the objective, gradient or hessian that we provided. Then it can be very useful to use the TimerOutputs.jl package. This package allows us to run an over-all timer for optimize
, and add individual timers for f
, g!
, and h!
. Consider the example below, that is due to the author of the package (Kristoffer Carlsson).
using TimerOutputs
const to = TimerOutput()
f(x ) = @timeit to "f" prob.f(x)
g!(x, g) = @timeit to "g!" prob.g!(x, g)
h!(x, h) = @timeit to "h!" prob.h!(x, h)
begin
reset_timer!(to)
@timeit to "Trust Region" begin
res = Optim.optimize(f, g!, h!, prob.initial_x, NewtonTrustRegion())
end
show(to; allocations = false)
end
We see that the time is actually not spent in our provided functions, but most of the time is spent in the code for the trust region method.
Early stopping
Sometimes it might be of interest to stop the optimizer early. The simplest way to do this is to set the iterations
keyword in Optim.Options
to some number. This will prevent the iteration counter exceeding some limit, with the standard value being 1000. Alternatively, it is possible to put a soft limit on the run time of the optimization procedure by setting the time_limit
keyword in the Optim.Options
constructor.
using Optim, OptimTestProblems
problem = UnconstrainedProblems.examples["Rosenbrock"]
f = problem.f
initial_x = problem.initial_x
function slow(x)
sleep(0.1)
f(x)
end
start_time = time()
optimize(slow, zeros(2), NelderMead(), Optim.Options(time_limit = 3.0))
This will stop after about three seconds. If it is more important that we stop before the limit is reached, it is possible to use a callback with a simple model for predicting how much time will have passed when the next iteration is over. Consider the following code
using Optim, OptimTestProblems
problem = UnconstrainedProblems.examples["Rosenbrock"]
f = problem.f
initial_x = problem.initial_x
function very_slow(x)
sleep(.5)
f(x)
end
start_time = time()
time_to_setup = zeros(1)
function advanced_time_control(x)
println(" * Iteration: ", x.iteration)
so_far = time()-start_time
println(" * Time so far: ", so_far)
if x.iteration == 0
time_to_setup .= time()-start_time
else
expected_next_time = so_far + (time()-start_time-time_to_setup[1])/(x.iteration)
println(" * Next iteration ≈ ", expected_next_time)
println()
return expected_next_time < 13 ? false : true
end
println()
false
end
optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control))
It will try to predict the elapsed time after the next iteration is over, and stop now if it is expected to exceed the limit of 13 seconds. Running it, we get something like the following output
julia> optimize(very_slow, zeros(2), NelderMead(), Optim.Options(callback = advanced_time_control))
* Iteration: 0
* Time so far: 2.219298839569092
* Iteration: 1
* Time so far: 3.4006409645080566
* Next iteration ≈ 4.5429909229278564
* Iteration: 2
* Time so far: 4.403923988342285
* Next iteration ≈ 5.476739525794983
* Iteration: 3
* Time so far: 5.407265901565552
* Next iteration ≈ 6.4569235642751055
* Iteration: 4
* Time so far: 5.909044027328491
* Next iteration ≈ 6.821732044219971
* Iteration: 5
* Time so far: 6.912338972091675
* Next iteration ≈ 7.843148183822632
* Iteration: 6
* Time so far: 7.9156060218811035
* Next iteration ≈ 8.85849153995514
* Iteration: 7
* Time so far: 8.918903827667236
* Next iteration ≈ 9.870419979095459
* Iteration: 8
* Time so far: 9.922197818756104
* Next iteration ≈ 10.880185931921005
* Iteration: 9
* Time so far: 10.925468921661377
* Next iteration ≈ 11.888488478130764
* Iteration: 10
* Time so far: 11.92870283126831
* Next iteration ≈ 12.895747828483582
* Iteration: 11
* Time so far: 12.932114839553833
* Next iteration ≈ 13.902462200684981
Results of Optimization Algorithm
* Algorithm: Nelder-Mead
* Starting Point: [0.0,0.0]
* Minimizer: [0.23359374999999996,0.042187499999999996, ...]
* Minimum: 6.291677e-01
* Iterations: 11
* Convergence: false
* √(Σ(yᵢ-ȳ)²)/n < 1.0e-08: false
* Reached Maximum Number of Iterations: false
* Objective Function Calls: 24