Augmented Lagrangian methods as modified penalty methods
Augmented Lagrangian methods can be reformulated as penalty methods with a moving constraint boundary, which allows for an intuitive interpretation of Lagrange multipliers.
Contents
Introduction
Recently I’ve been looking into constrained numerical optimization and came across an interesting paper by Stepanovic et al. (2024 Stepanovic, K., Böhmer, W. & Weerdt, M. (2024). A penalty-based guardrail algorithm for non-decreasing optimization with inequality constraints. arXiv preprint arXiv:2405.01984. Retrieved from https://arxiv.org/abs/2405.01984 ). In this paper they present a guardrailed penalty method that is an extension of the basic quadratic penalty method for constrained optimization. They introduce a guardrail variable $\epsilon$ that acts to tighten the constraint when violations occur. This allows for easier convergence to a solution where the constraint is sufficiently satisfied. They proof convergence for the case of an increasing objective function and non-decreasing constraint functions, but I wondered whether the method should not be more generally applicable. Specifically, whether it would also be usable when not all constraints can be simultaneously active They argue that the properties of the objective and constraint functions imply that it is always preferable for constraints to be active. However, it is possible to consider a situation where not all constraints can be active at the same time, for example if one constraint function is always strictly smaller than the other. .
After having a closer look at the equations, I noticed that their method is actually equivalent to a kind of augmented Lagrangian method (ALM), which combines the basic penalty method and the method of Lagrange multipliers for satisfying equality constraints To summarize, their method is essentially a reformulation of the basic AL method for equality constraints (Nocedal & Wright, 1999 Nocedal, J. & Wright, S. (1999). Numerical optimization. Springer. ) with the modification that the Lagrange multiplier is constrained to be non-negative, and the Lagrange multiplier is given a decreasing "learning rate". . This equivalence should be a nice way to obtain some theoretical results, and additionally it provides an interesting reformulation of the AL method as a kind of shifted or guardrailed penalty method. Since Lagrangian methods are not always the most intuitive, this can provide an useful interpretation of Lagrange multipliers, and also suggests a slightly different numerical calculation which might be sometimes advantageous. In this article I will present this reformulation for optimization with equality and inequality constraints.
The problem
When doing constrained optimization (with equality constraints), we’d like to solve a problem of the form:
\begin{aligned} \underset{u}{\min} &\, J(u) \\ \text{subject to } &\, \gamma(u) = 0 \end{aligned}
where $u \in \mathbb{R}^N$ is a vector of $N$ decision variables, $J : \mathbb{R}^N \rightarrow \mathbb{R}$ is the objective function we want to minimize and $\gamma : \mathbb{R}^N \rightarrow \mathbb{R}$ is the function for a single equality constraint I'm using the same notation as in Stepanovic et al. (2024 Stepanovic, K., Böhmer, W. & Weerdt, M. (2024). A penalty-based guardrail algorithm for non-decreasing optimization with inequality constraints. arXiv preprint arXiv:2405.01984. Retrieved from https://arxiv.org/abs/2405.01984 ) for ease of comparison. I'll also stick with a single constraint for clarity of presentation, but the whole analysis can be quite easily repeated to add more constraints. This would involve adding more terms to the Lagrangian functions and optionally changing sums and products into vector operations. .
One general approach to tackling these constrained problems is to try and find an unconstrained problem with the same solutions. There are two common methods of doing this:
- Penalty method: add a term to the objective function that penalizes constraint violations. The usual choice is: $$ \underset{u}{\min} \widetilde{J}(u) = \underset{u}{\min} J(u) + C \gamma(u)^2 $$ where $C \in (0, \infty)$ is a parameter specifying the strength of the penalty. Solutions to the original problem are then local minima of this function.
- Lagrangian method: construct the Lagrangian of the problem by adding a term to the objective function involving the constraint function and a Lagrange multiplier $\lambda \in \mathbb{R}$: $$ L(u, \lambda) = J(u) - \lambda \gamma(u) $$ The solutions to the original problem are then saddle points of this function (mimima in $u$, maxima in $\lambda$).
In both cases these functions have a set of solutions that is bigger than just the solutions to the original problem. For example, the penalty function will have a zero gradient when $\nabla_u J = -2C\gamma \nabla_u \gamma$, which does not imply that $\gamma(u) = 0$. The Lagrangian on the other hand can blow up if the maximization in $\lambda$ is performed incorrectly. It is important to avoid these spurious solutions, and this can be difficult.
The augmented Lagrangian method
The two approaches from the previous section can be combined into an “augmented Lagrangian” function:
\begin{aligned} \widetilde{L}(u, \lambda) &\equiv J(u) - \lambda \gamma(u) + C \gamma(u)^2 \\ &= L(u, \lambda) + C \gamma(u)^2 \\ &= \widetilde{J}(u) - \lambda \gamma(u) \end{aligned}
which can be viewed as either the Lagrangian augmented with an extra penalty on constraint violations, or the penalty function turned into a Lagrangian by enforcing the constraints with Lagrange multipliers. The advantage of this combination is that the Lagrange multipliers require a solution that satisfies the constraints exactly, while the quadratic penalty term encourages movement towards the constraint when far away from the boundary.
The following algorithm is then used to find a solution $u$:
Algorithm 1
- Pick some starting values $u_0$ and $\lambda_0$.
- Set a loop variable $k = 0$.
- Minimize $\widetilde{L}(u, \lambda_k)$ in $u$ starting at $u_k$ using any method. Call the solution $u_{k+1}$.
- Update $\lambda$ by setting $\lambda_{k+1} = \lambda_k - 2C\gamma(u_{k+1})$.
- Update $k \leftarrow k + 1$.
- Repeat until the solution $u_k$ is stationary or sufficiently optimal by some condition.
This method is useful because the core of it is unconstrained minimization that can be performed using many different kinds of methods, and one can be selected based on context. It has been shown in practice that this method converges to a solution that satisfies the constraint much more quickly than the methods in the previous section, and there are no parameters or variables that have to take an unbounded value.
The equality-constrained ALM as shifted penalty method
The AL method is quite simple algorithmically, but it can be difficult to articulate the function of the multiplier $\lambda$ in an intuitive way. From looking at the functions and the algorithm it is not readily apparent what kind of values $\lambda$ should take during the optimization process, and why this will help satisfy the constraint. These are questions that can be answered more clearly by reformulating as a modified penalty function.
First, let’s define the variable $\epsilon = \frac{\lambda}{2C}$. We can rewrite the AL function using $\epsilon$ instead of $\lambda$ using the substitution $\lambda = 2 C \epsilon$:
\begin{aligned} \widetilde{L}(u, \epsilon) &= J(u) - 2 C \epsilon \gamma(u) + C \gamma(u)^2 \\ &= J(u) + C \left( \gamma(u)^2 - 2 \epsilon \gamma(u) \right) \end{aligned}
The last term looks like part of an expanded squared difference $(a - b)^2 = a^2 - 2ab + b^2$, so let’s try completing the square:
\begin{aligned} \hat{J}(u, \epsilon) &\equiv \widetilde{L}(u, \epsilon) + C \epsilon^2 \\ &= J(u) + C \left( \gamma(u)^2 - 2 \epsilon \gamma(u) + \epsilon^2 \right) \\ &= J(u) + C \left( \gamma(u) - \epsilon \right)^2 \end{aligned}
Now, since we only added a term that is constant in $u$, the minimization problem over $u$ inside the ALM algorithm might as well use $\hat{J}$ instead of $\widetilde{L}$. We can also rewrite the update rule for $\lambda$ in terms of $\epsilon$, and we end up with the following procedure equivalent to algorithm 1:
Algorithm 2
- Pick some starting values $u_0$ and $\epsilon_0$.
- Set a loop variable $k = 0$.
- Minimize $\hat{J}(u, \epsilon_k)$ in $u$ starting at $u_k$ using any method. Call the solution $u_{k+1}$.
- Update $\epsilon$ by setting $\epsilon_{k+1} = \epsilon_k - \gamma(u_{k+1})$.
- Update $k \leftarrow k + 1$.
- Repeat until the solution $u_k$ is stationary or sufficiently optimal by some condition.
Compared to the role of $\lambda$ before, the role of $\epsilon$ is a bit easier to interpret. In $\hat{J}$, $\epsilon$ moves around the target value of $\gamma(u)$, in other words the constraint curve is shifted up or down a bit. And by how much is is shifted? The update rule in step 4 says that $\epsilon$ is always moved in the opposite direction of $\gamma(u)$.
This makes sense: suppose we perform one minimization of $\hat{J}$ and end up with $\gamma(u)$ bigger than 0. Then apparently we are stuck in a local minimum where bringing $u$ closer to the constraint curve is not worth the extra cost in $J(u)$. We could mitigate this by increasing the penalty strength $C$, but alternatively we can also move the constraint curve to be slightly lower, which of course also increases the size of the penalty. If we then overshoot and end up on the other side of the curve, we shift the curve up again, and so on until eventually we (hopefully) stabilize on the constraint curve.
As such $\epsilon$ now has a straightforward interpretation as a shift variable. And since $\epsilon$ is just a rescaled version of $\lambda$ and the algorithm is otherwise the same, this same explanation can also be applied to the Lagrange multiplier $\lambda$.
The inequality-constrained ALM as guardrailed penalty method
Often we are not dealing with equality constraints but inequality constraints. Those give problems of the form:
\begin{aligned} \underset{u}{\min} &\, J(u) \\ \text{subject to } &\, \gamma(u) \ge 0 \end{aligned}
It is possible to derive a modified version of the AL method but allows for inequality constraints, see for example section 17.4 of Nocedal & Wright (1999 Nocedal, J. & Wright, S. (1999). Numerical optimization. Springer. ) or section 3.1 of Bertsekas (1982 Bertsekas, D. (1982). Constrained optimization and lagrange multiplier methods. Academic Press. Retrieved from https://www.mit.edu/~dimitrib/lagr_mult.html ) for a more detailed look In summary, the derivation proceeds by adding a slack variable $s$ to transform the inequality $\gamma(u) = 0$ into the equality $\gamma(u) = s$ (with $s \ge 0$), and writing out the AL method for equality constraints. Then, the minimization of $\widetilde{L}$ in terms of the slack variables $s$ can be explicitly performed, since $\widetilde{L}$ is quadratic in $s$. This allows the variables to be dropped in favor of modifying the algorithm slightly. . Two modifications are necessary: the Lagrangian is slightly modified and the update rule for $\lambda$ is changed.
The Lagrangian is redefined to be:
$$ \widetilde{L}^*(u, \lambda) = J(u) + \begin{cases} - \lambda \gamma(u) + C \gamma(u)^2 & 2 C \gamma(u) \lt \lambda \\ - \frac{1}{4C} \lambda^2 & 2 C \gamma(u) \ge \lambda \end{cases} $$
and the modified algorithm goes as follows:
Algorithm 3
- Pick some starting values $u_0$ and $\lambda_0$.
- Set a loop variable $k = 0$.
- Minimize $\widetilde{L}^*(u, \lambda_k)$ in $u$ starting at $u_k$ using any method. Call the solution $u_{k+1}$.
- Update $\lambda$ by setting $\lambda_{k+1} = \max\left(0, \lambda_k - 2C\gamma(u_{k+1})\right)$.
- Update $k \leftarrow k + 1$.
- Repeat until the solution $u_k$ is stationary or sufficiently optimal by some condition.
Here the only change compared to algorithm 2 is that $\lambda$ is constrained to be non-negative This is easily motivated, since a negative $\lambda$ means that the term $-\lambda \gamma(u)$ can grow without bounds if the constraint is inactive, i.e. $\gamma(u) \gt 0$, even though that is a valid solution. If $\lambda \ge 0$ is enforced the maximizing value of $\lambda$ for an inactive constraint is 0. .
However, the reason for the modified Lagrangian $\widetilde{L}^*$ is less clear. We can again perform a reformulation in terms of a modified penalty function which will clarify the change.
Let’s apply the substitution $\lambda = 2 C \epsilon$. We already saw how the first case changes. The second case simplifies nicely:
$$- \frac{1}{4C} \lambda^2 = - \frac{1}{4C} 4 C^2 \epsilon^2 = - C \epsilon^2 $$
and the piecewise condition as well:
$$ 2 C \gamma(u) \lt \lambda \rightarrow 2 C \gamma(u) \lt 2 C \epsilon \rightarrow \gamma(u) \lt \epsilon $$
which leaves us with the function:
$$ \widetilde{L}^*(u, \epsilon) = J(u) + \begin{cases} C \left( \gamma(u)^2 - 2 \epsilon \gamma(u) \right) & \gamma(u) \lt \epsilon \\ - C \epsilon^2 & \gamma(u) \ge \epsilon \end{cases} $$
Now, let’s define the penalty function in the same way as well:
$$ \hat{J}^*(u, \epsilon) \equiv \widetilde{L}^*(u, \epsilon) + C \epsilon^2 = J(u) + \begin{cases} C \left( \gamma(u) - \epsilon \right)^2 & \gamma(u) \lt \epsilon \\ 0 & \gamma(u) \ge \epsilon \end{cases} $$
Analogously to algorithm 2, we can define the optimization procedure as a variation on algorithm 3:
Algorithm 4
- Pick some starting values $u_0$ and $\epsilon_0$.
- Set a loop variable $k = 0$.
- Minimize $\hat{J}^*(u, \epsilon_k)$ in $u$ starting at $u_k$ using any method. Call the solution $u_{k+1}$.
- Update $\epsilon$ by setting $\epsilon_{k+1} = \max\left(0, \epsilon_k - \gamma(u_{k+1})\right)$.
- Update $k \leftarrow k + 1$.
- Repeat until the solution $u_k$ is stationary or sufficiently optimal by some condition.
While equivalent, this formulation is a lot easier to interpret than the AL one. $\epsilon$ still acts as a shift in the constraint, though now it can only be positive. This means that it can only choose to tighten the constraint in response to constraint violations. In this sense it is acting as a guardrail: if we notice the constraint is being violated we tighten the constraint a bit to make sure we remain within bounds, and when the violations disappear we loosen the constraint again.
The interpretation of $\hat{J^*}$ is also straightforward. In the interior region $\gamma(u) \ge \epsilon$ (where the constraint is satisfied) we don’t need any kind of penalty, which corresponds to the second case. In the exterior region $\gamma(u) \lt \epsilon$ we do need a penalty to encourage movement towards the boundary $\gamma(u) = \epsilon$, for which we use the same shifted quadratic penalty as in the equality constraint case, since the boundary is an equality.
Numerical implications
It is interesting to note that we can remove the piecewise definition in $\hat{J}^*(u, \epsilon)$. Since the condition $\gamma(u) \lt \epsilon$ is equivalent to $\gamma(u) - \epsilon \lt 0$, we can instead clip the term inside the square:
$$ \hat{J}^*(u, \epsilon) = J(u) + C \min(\gamma(u) - \epsilon, 0)^2 $$
If we satisfy the constraint and $\gamma(u) \ge \epsilon$, then the term inside the brackets will be zero and the penalty will disappear. Since it is inside a square, we can alternatively change the sign and use a $\max$ function or the beloved ReLU instead:
$$ \hat{J}^*(u, \epsilon) = J(u) + C \max(0, \epsilon - \gamma(u))^2 = J(u) + C \operatorname{relu}(\epsilon - \gamma(u))^2 $$
One disadvantage of the AL method with inequality constraints as presented is that $\hat{J}^*$ does not have a second derivative everywhere. Specifically, at the constraint boundary the second derivative is undefined. Therefore second-order minimization algorithms like Newton’s method may not behave so well on it when certain constraints are exactly satisfied. This last expression shows that it may be sensible to replace the squared ReLU function by a smooth variant that is (more than once) differentiable, as is common in machine learning.
References
- Bertsekas, D. (1982). Constrained optimization and lagrange multiplier methods. Academic Press. Retrieved from https://www.mit.edu/~dimitrib/lagr_mult.html
- Nocedal, J. & Wright, S. (1999). Numerical optimization. Springer.
- Stepanovic, K., Böhmer, W. & Weerdt, M. (2024). A penalty-based guardrail algorithm for non-decreasing optimization with inequality constraints. arXiv preprint arXiv:2405.01984. Retrieved from https://arxiv.org/abs/2405.01984