by me, Harshat Kumar, George J. Pappas, and Alejandro Ribeiro
published on August 7 2020
This is my first (and hopefully not the last) blog-type technical article, and our attempt (with Harshat, George, and Alejandro) to look closer into intrinsic connections between Policy Gradient (PG) (both stochastic and deterministic) and zeroth-order methods in non-convex stochastic optimization. To the best of our knowledge, such connections, although very natural (as we will see below), have not been explored before, and this is why I have decided to discuss them in this post.
Put simply, what this article shows is that, under basic and natural assumptions, policy gradients are zeroth-order quasigradients of the gradient associated with the original stochastic program at hand, which of course implies that policy gradient methods are themselves instances of standard, general purpose zeroth-order methods, specialized in a Reinforcement Learning (RL) setting.
Further, an additional byproduct of our treatment is to show explicitly that, in fact, classical stochastic policy gradient (even with a baseline) results in quasigradients that are too inexact (relative to the true gradient of the original objective), and that stochastic policy gradient injects too much exploration noise during policy learning. This eventually results in both poor average performance and performance margins of unnecessarily high variance, even when randomness in the policy is discarded during policy evaluation; the latter scheme is a very popular heuristic in practice.
Then, a rather plausible way to mitigate those issues is to fully depart from adopting stochastic policies for policy learning, which are integral to (stochastic) policy gradients, and instead explore the possibility of developing a primitive zeroth-order policy gradient method based on the very well-known Deterministic Policy Gradient (DPG) theorem of [Silver et al., 2014], which is originally fully gradient-based and thus, in a sense, model-based, because it requires the full gradient of the associated -function in order to be invoked. This fact motivates Zeroth-order Deterministic Policy Gradient (ZDPG), which we have recently developed in [Kumar et al., 2020] (download here) and which essentially provides a model-free version of the framework set by Silver et al. in their seminal paper. The following table provides a basic classification of the possible actor-only PG approaches, and where the various resulting algorithms stand.
Policy\Method | Model-Based (-Gradient-Based) | Model-Free |
Stochastic | ? | (Baseline) PG [2000] |
Deterministic | DPG [2014] | ZDPG [2020] |
In particular, in ZDPG, parameter exploration is efficiently achieved by using a purely deterministic policy together with a minimal amount of low-dimensional action-space exploration noise, which is intuitively and rigorously justified from the perspective of stochastic zeroth-order optimization. As we later show explicitly in this article, the zeroth-order quasigradients associated with ZDPG induce much less and better-controlled approximation error relative to the true deterministic policy gradient corresponding to the initial dynamic program, as compared to not only stochastic PG, but also Baseline PG (PG-B), for a number of standard baseline choices. Our discussion justifies our experimental results presented in [Kumar et al., 2020].
Remark. The question mark in the table above corresponds to the class of (actor-only) model-based methods involving stochastic policies, which do not seem to be very useful at least in regard to the problem setting considered in this article, since we are interested in model-free RL.
To read more, click and expand the titles below.
–– A Standardized Problem Setting We will consider a standardized RL problem, in which an agent moves through a continuous and compact state space , and takes actions in a finite dimensional action space . The agent’s task is to accumulate as much reward as possible in the long term, where the reward is revealed to the agent by the environment at each step. This generic problem may be abstracted by a Markov decision process (MDP) as a tuple , where denotes the reward function, and determines the probability of moving to state starting from state and action , satisfying the Markov property
The value is the discount factor which determines how much future rewards matter to the behavior of the agent. Lastly, the policy defines a mapping from a state to an action, deterministic or randomized, depending on the approach taken (will be made clear below). Then, the problem is to choose a policy that maximizes the expected discounted sum of rewards over all starting states; this means finding a policy that maximizes the expected value function starting from state or, formally, (1) where is the initial state distribution and is the class of policies that we are interested in; this can be either the class of all deterministic admissible policies for the particular problem, in which case the random measure trivializes to a function , or an appropriate class of randomized policies. By conditioning the value function on an action, we can define the state-action function by The -function determines the quality of taking an action while being at state and then following policy for the remainder of time. By selecting the first action according to the policy and integrating, the (integrated) -function becomes equivalent to the value function, and problem (1) is equivalently written as (2) Apparently, solving (2) requires searching over an infinite dimensional space (a space of probability measures in the case of randomized policies and a space of functions in the case of deterministic policies), which is generally an intractable task. To circumvent this complexity, the policy is parameterized by some so that the search is over a Euclidean space of finite dimension. The problem of interest consequently becomes (3) When the search is over deterministic policy parameterizations, in particular, the above problem takes the simpler form (4) Although it might seem that searching within a class of randomized policies could be advantageous over searching within a class of deterministic policies, this is in fact not the case at least in unconstrained RL, because for every randomized policy (and under standard regularity conditions), it is possible to find another deterministic policy that achieves at least the same (average) performance (this is really a standard result). Additionally, a deterministic optimal policy is much more desirable, because it induces the smallest amount of higher-order statistical variability possible in the resulting random cumulative reward (still resulting, of course, in optimal mean performance). Therefore, in the following, our base problem (i.e., the one what we are primarily interested in) will be (4). Even if we may sometimes focus on (3) (i.e., randomized policies), the latter should be implicitly considered as a proxy for solving (4). This is perfectly in line with the common practice of policy derandomization during policy evaluation, that is, when evaluating performance after a (near-)optimal (stochastic) policy has been learned. –– Gaussian Policies in Stochastic Policy Gradient (PG)
where constitutes the (proper) discounted state distribution associated with the stochastic policy . In somewhat more precise mathematical language, this expression may be rewritten, for every , as
Note that the product is also well-known as the discounted action-space distribution associated with . As most often done both in theoretical developments and in practice, let us assume that, for every , is chosen as a Gaussian random element with mean (a known deterministic policy parameterization) and constant variance , for some . In other words, highlighting the dependence on ,
Then, it follows that, for every ,
Therefore, it is true that, for every ,
Plugging into the expression for the (stochastic) policy gradient, we have
Regarding the inner expectation (given ), we may now write
Also note that (5) What we have shown here is that the part of the expression of the policy gradient which is not related to the gradient of the deterministic part of the involved Gaussian policy (i.e., its mean) may be exactly represented as an average of stochastic finite differences of the associated -function. This fact has important consequences, in terms of not only interpretability, but also algorithm design and performance analysis, as we will see below. –– Baselines: Stochastic, Deterministic and In-Between Indeed, note that it is possible to also write
where and are iid and, trivially,
This gives exactly policy gradient with a stochastic baseline (neat, right?)! Of course, we may also write
where is the corresponding value function at state , giving exactly the classical (i.e., deterministic) baseline policy gradient. We could even look at variances. To do this, we will also assume that the -function is -Lipschitz in the sense that, for every ,
where denotes the whole class of admissible stochastic policies considered in our formulation. Let us take baseline policy gradient first. We have As a result,
This yields
Similarly, for the stochastic baseline we have This also yields
At this point, it would be interesting to note that, even if we attempted variance reduction via defining
the result would be precisely the same, since
We note that here the bounds might be somewhat loose, but they exploit the same known facts as in the derivations above. What remains here would be to evaluate the resulting policy gradient baselines empirically, to confirm whether they indeed exhibit similar empirical performance. We have partially done that in our recent preprint Zeroth-order Deterministic Policy Gradient (ZDPG), which is available for download on this website. Put simply, our results showed that, at least on the problem setting considered in our experiments, baseline policy gradient exhibits similar performance either one or more sample rollouts (i.e., ) are considered in the baseline evaluation (more rollouts imply that the resulting stochastic baseline is closer to the true –that is, deterministic– baseline –that is, the value function–). –– Stochastic PG in Deterministic PG Form, and ZDPG
In that case, we end up with the equivalent representation (again, neat, right?)
This of course coincides with the deterministic policy gradient representation of [Silver et al., 2014], but for the stochastic policy gradient setup with Gaussian parameterization and exploration parameter . Additionally, the relation above holds also for identically (from [Silver et al., 2014]), giving, very carefully,
Observe the difference in the superscripts of the -function, as well as the different discounted state distributions. It should be now kind of obvious that may be thought of as a quasigradient of , which is the exact gradient of our infinite horizon dynamic program, where the search is over all admissible deterministic policies parameterized by . At this point, it is worth mentioning that, in our work on ZDPG, we instead define the quasigradient
which, as we have shown, is provably a uniform approximation to (in fact, under weaker Lipschitz assumptions on the associated -function). But what about ? Well, let’s see (simply add and substract terms):
We can therefore identify three sources or error in the quasigradient : Again, it is noteworthy that in our work on ZDPG, gradient approximation errors are solely due to perturbations, i.e., all three latter terms in the right-hand side of the expression above do not appear in the analysis. It turns out that this difference in the approximation quality of the two quasigradients just discussed ( and that of ZDPG, ) translates into rather significant effects with regard to empirical performance. We would like to very briefly demonstrate this through a numerical comparison between stochastic PG and ZDPG on the inverted pendulum problem, which, although deterministic, is a standard benchmark for testing RL algorithms. In the plots, “ZDPG” and “ZDPG-S” are ZDPG variants that employ non-symmetric and symmetric stochastic finite differences, respectively, very similarly to the second and third line of (5), respectively, but in the zeroth-order setting. Also, PG-B denotes the standard stochastic PG method, with the stochastic baseline , as this is defined in our discussion about baselines above. For all experiments, the forgetting factor is chosen as , and algorithm hyperparameters such as stepsizes (chosen constant), exploration variances, smoothing parameters, etc. (where applicable) have been “optimally” tuned via trial-and-error. Further, in the left plot we see results without any kind of minibatching, whereas in the right plot we employ full gradient minibatches of size . For policy evaluation we have used deterministic policies (“mean” learned policy for PG-B), and cumulative rewards are collected for steps (problem stages), whose mean and variance are shown, computed over independent trials for each method. For more details on the algorithms, see our paper on ZDPG [Kumar et al., 2020]. We observe that ZDPG-S substantially outperforms PG-B in both plots, not only in terms of average performance, but also in terms of statistical variability. This shows that, really, injection of unnecessary exploration noise into the policy learning procedure can actually make performance worse, both on average and in terms of margins. On a side note, also observe that minibatching actually hampered the performance of PG-B; this is probably due to the method being stuck at local minima or flat regions, together with using a constant stepsize. Of course, ZDPG-S is also prone to “sticking” (problem is non-convex anyways), but much less so. –– Main Conclusions Thanks for reading; comments are most welcome! References
ok
ok