1 Introduction
Modelfree reinforcement learning has achieved stateoftheart results in many challenging domains. However, these methods learn blackbox control policies and typically suffer from poor sample complexity and generalization. Alternatively, modelbased approaches seek to model the environment the agent is interacting in. Many modelbased approaches utilize Model Predictive Control (MPC) to perform complex control tasks (González et al., 2011; Lenz et al., 2015; Liniger et al., 2014; Kamel et al., 2015; Erez et al., 2012; Alexis et al., 2011; Bouffard et al., 2012; Neunert et al., 2016). MPC leverages a predictive model of the controlled system and solves an optimization problem online in a receding horizon fashion to produce a sequence of control actions. Usually the first control action is applied to the system, after which the optimization problem is solved again for the next time step.
Formally, MPC requires that at each time step we solve the optimization problem:
(1) 
where are the state and control at time , and are constraints on valid states and controls, is a (potentially timevarying) cost function, is a dynamics model, and is the initial state of the system. The optimization problem in (1) can be efficiently solved in many ways, for example with the finitehorizon iterative Linear Quadratic Regulator (iLQR) algorithm (Li and Todorov, 2004). Although these techniques are widely used in control domains, much work in deep reinforcement learning or imitation learning opts instead to use a much simpler policy class such as a linear function or neural network. The advantages of these policy classes is that they are differentiable and the loss can be directly optimized with respect to them while it is typically not possible to do full endtoend learning with modelbased approaches.
In this paper, we consider the task of learning MPCbased policies in an endtoend fashion, illustrated in Figure 1. That is, we treat MPC as a generic policy class parameterized by some representations of the cost and dynamics model . By differentiating through the optimization problem, we can learn the costs and dynamics model to perform a desired task. This is in contrast to regressing on collected dynamics or trajectory rollout data and learning each component in isolation, and comes with the typical advantages of endtoend learning (the ability to train directly based upon the task loss of interest, the ability to “specialize” parameter for a given task, etc).
Still, efficiently differentiating through a complex policy class like MPC is challenging. Previous work with similar aims has either simply unrolled and differentiated through a simple optimization procedure (Tamar et al., 2017) or has considered generic optimization solvers that do not scale to the size of MPC problems (Amos and Kolter, 2017). This paper makes the following two contributions to this space. First, we provide an efficient method for analytically differentiating through an iterative nonconvex optimization procedure based upon a boxconstrained iterative LQR solver (Tassa et al., 2014); in particular, we show that the analytical derivative can be computed using one additional backward pass of a modified iterative LQR solver. Second, we empirically show that in imitation learning scenarios we can recover the cost and dynamics from an MPC expert with a loss based only on the actions (and not states). In one notable experiment, we show that directly optimizing the imitation loss results in better performance than vanilla system identification.
2 Background and Related Work
Pure modelfree techniques for policy search have demonstrated promising results in many domains by learning reactive polices which directly map observations to actions (Mnih et al., 2013; Oh et al., 2016; Gu et al., 2016b; Lillicrap et al., 2015; Schulman et al., 2015, 2016; Gu et al., 2016a). Despite their success, modelfree methods have many drawbacks and limitations, including a lack of interpretability, poor generalization, and a high sample complexity. Modelbased methods are known to be more sampleefficient than their modelfree counterparts. These methods generally rely on learning a dynamics model directly from interactions with the real system and then integrate the learned model into the control policy (Schneider, 1997; Abbeel et al., 2006; Deisenroth and Rasmussen, 2011; Heess et al., 2015; Boedecker et al., 2014). More recent approaches use a deep network to learn lowdimensional latent state representations and associated dynamics models in this learned representation. They then apply standard trajectory optimization methods on these learned embeddings (Lenz et al., 2015; Watter et al., 2015; Levine et al., 2016). However, these methods still require a manually specified and handtuned cost function, which can become even more difficult in a latent representation. Moreover, there is no guarantee that the learned dynamics model can accurately capture portions of the state space relevant for the task at hand.
To leverage the benefits of both approaches, there has been significant interest in combining the modelbased and modelfree paradigms. In particular, much attention has been dedicated to utilizing modelbased priors to accelerate the modelfree learning process. For instance, synthetic training data can be generated by modelbased control algorithms to guide the policy search or prime a modelfree policy (Sutton, 1990; Theodorou et al., 2010; Levine and Abbeel, 2014; Gu et al., 2016b; Venkatraman et al., 2016; Levine et al., 2016; Chebotar et al., 2017; Nagabandi et al., 2017; Sun et al., 2017). (Bansal et al., 2017) learns a controller and then distills it to a neural network policy which is then finetuned with modelfree policy learning. However, this line of work usually keeps the model separate from the learned policy.
Alternatively, the policy can include an explicit planning module which leverages learned models of the system or environment, both of which are learned through modelfree techniques. For example, the classic DynaQ algorithm (Sutton, 1990) simultaneously learns a model of the environment and uses it to plan. More recent work has explored incorporating such structure into deep networks and learning the policies in an endtoend fashion. Tamar et al. (2016) uses a recurrent network to predict the value function by approximating the value iteration algorithm with convolutional layers. Karkus et al. (2017) connects a dynamics model to a planning algorithm and formulates the policy as a structured recurrent network. Silver et al. (2016) and Oh et al. (2017) perform multiple rollouts using an abstract dynamics model to predict the value function. A similar approach is taken by Weber et al. (2017) but directly predicts the next action and reward from rollouts of an explicit environment model. Farquhar et al. (2017) extends modelfree approaches, such as DQN (Mnih et al., 2015) and A3C (Mnih et al., 2016), by planning with a treestructured neural network to predict the costtogo. While these approaches have demonstrated impressive results in discrete state and action spaces, they are not applicable to continuous control problems.
To tackle continuous state and action spaces, Pascanu et al. (2017) propose a neural architecture which uses an abstract environmental model to plan and is trained directly from an external task loss. Pong et al. (2018) learn goalconditioned value functions and use them to plan single or multiple steps of actions in an MPC fashion. Similarly, Pathak et al. (2018) train a goalconditioned policy to perform rollouts in an abstract feature space but ground the policy with a loss term which corresponds to true dynamics data. The aforementioned approaches can be interpreted as a distilled optimal controller which does not separate components for the cost and dynamics. Taking this analogy further, another strategy is to differentiate through an optimal control algorithm itself. Okada et al. (2017) and Pereira et al. (2018) present a means to differentiate through path integral optimal control (Williams et al., 2016, 2017) and learn a planning policy endtoend. Srinivas et al. (2018) shows how to embed differentiable planning (unrolled gradient descent over actions) within a goaldirected policy. In a similar vein, Tamar et al. (2017) differentiates through an iterative LQR (iLQR) solver (Li and Todorov, 2004; Xie et al., 2017; Tassa et al., 2014) to learn a costshaping term offline. This shaping term enables a shorter horizon controller to approximate the behavior of a solver with a longer horizon to save computation during runtime.
Contributions of our paper. All of these methods require differentiating through planning procedures by explicitly “unrolling” the optimization algorithm itself. While this is a reasonable strategy, it is both memory and computationallyexpensive and challenging when unrolling through many iterations because the time and spacecomplexity of the backward pass grows linearly with the forward pass. In contrast, we address this issue by showing how to analytically differentiate through the fixed point of a nonlinear MPC solver. Specifically, we compute the derivatives of an iLQR solver with a single LQR step in the backwards pass. This makes the learning process more computationally tractable while still allowing us to plan in continuous state and action spaces. Unlike modelfree approaches, explicit cost and dynamics components can be extracted and analyzed on their own. Moreover, in contrast to pure modelbased approaches, the dynamics model and cost function can be learned entirely endtoend.
3 Differentiable LQR
Discretetime finitehorizon LQR is a wellstudied control method that optimizes a convex quadratic objective function with respect to affine statetransition dynamics from an initial system state . Specifically, LQR finds the optimal nominal trajectory by solving the following optimization problem
(2) 
From a policy learning perspective, this can be interpreted as a module with unknown parameters
, which can be integrated into a larger endtoend learning system. The learning process involves taking derivatives of some loss function
, which are then used to update the parameters. Instead of directly computing each of the individual gradients, we present an efficient way of computing the derivatives of the loss function with respect to the parameters(3) 
By interpreting LQR from an optimization perspective (Boyd, 2008), we associate dual variables with the state constraints. The Lagrangian of the optimization problem is then given by
(4) 
where the initial constraint is represented by setting and . Differentiating (4) with respect to yields
(5) 
Thus, the normal approach to solving LQR problems with dynamic Riccati recursion can be viewed as an efficient way of solving the following KKT system
(6) 
Given an optimal nominal trajectory , (5) shows how to compute the optimal dual variables with the backward recursion
(7) 
where , , and are the first blockrows of , , and , respectively. Now that we have the optimal trajectory and dual variables, we can compute the gradients of the loss with respect to the parameters. Since LQR is a constrained convex quadratic , the derivatives of the loss with respect to the LQR parameters can be obtained by implicitly differentiating the KKT conditions. Applying the approach from Section 3 of Amos and Kolter (2017), the derivatives are
(8) 
where is the outer product operator, and and are obtained by solving the linear system
(9) 
We observe that (9) is of the same form as the linear system in (6) for the LQR problem. Therefore, we can leverage this insight and solve (9) efficiently by solving another LQR problem that replaces with and with 0. Moreover, this approach enables us to reuse the factorization of from the forward pass instead of recomputing. Algorithm 1 summarizes the forward and backward passes for a differentiable LQR module.
4 Differentiable MPC
While LQR is a powerful tool, it does not cover realistic control problems with nonlinear dynamics and cost. Furthermore, most control problems have natural bounds on the control space that can often be expressed as box constraints. These highly nonconvex problems, which we will refer to as model predictive control (MPC), are wellstudied in the control literature and can be expressed in the general form
(10) 
where the nonconvex cost function and nonconvex dynamics function are (potentially) parameterized by some . We note that more generic constraints on the control and state space can be represented as penalties and barriers in the cost function. The standard way of solving the control problem (10) is by iteratively forming and optimizing a convex approximation
(11) 
where we have defined the secondorder Taylor approximation of the cost around as
(12) 
with and . We also have a firstorder Taylor approximation of the dynamics around as
(13) 
with . In practice, a fixed point of (11) is often reached, especially when the dynamics are smooth. As such, differentiating the nonconvex problem (10) can be done exactly by using the final convex approximation. Without the box constraints, the fixed point in (11) could be differentiated with LQR as we show in Section 3. In the next section, we will show how to extend this to the case where we have box constraints on the controls as well.
4.1 Differentiating BoxConstrained QPs
First, we consider how to differentiate a more generic boxconstrained convex QP of the form
(14) 
Given active inequality constraints at the solution in the form , this problem turns into an equalityconstrained optimization problem with the solution given by the linear system
(15) 
With some loss function that depends on , we can use the approach in Amos and Kolter (2017) to obtain the derivatives of with respect to , , , and as
(16) 
where and are obtained by solving the linear system
(17) 
The constraint is equivalent to the constraint if . Thus solving the system in (17) is equivalent to solving the optimization problem
(18) 
4.2 Differentiating MPC with Box Constraints
At a fixed point, we can use (16) to compute the derivatives of the MPC problem, where and are found by solving the linear system in (9) with the additional constraint that if . Solving this system can be equivalently written as a zeroconstrained LQR problem of the form
(19) 
where is the iteration that (11) reaches a fixed point, and and are the corresponding approximations to the objective and dynamics defined earlier. Algorithm 2 summarizes the proposed differentiable MPC module. To solve the MPC problem in (10) and reach the fixed point in (11
), we use the boxDDP heuristic
(Tassa et al., 2014). For the zeroconstrained LQR problem in (19) to compute the derivatives, we use an LQR solver that zeros the appropriate controls.4.3 Drawbacks of Our Approach
Sometimes the controller does not run for long enough to reach a fixed point of (11), or a fixed point doesn’t exist, which often happens when using neural networks to approximate the dynamics. When this happens, (19) cannot be used to differentiate through the controller, because it assumes a fixed point. Differentiating through the final iLQR iterate that’s not a fixed point will usually give the wrong gradients. Treating the iLQR procedure as a compute graph and differentiating through the unrolled operations is a reasonable alternative in this scenario that obtains surrogate gradients to the control problem. However, as we empirically show in Section 5.1, the backwards pass of this method scales linearly with the number of iLQR iterations used in the forward. Instead, fixedpoint differentiation is constant time and only requires a single iLQR solve.
5 Experimental Results
In this section, we present several results that highlight the performance and capabilities of differentiable MPC in comparison to neural network policies and vanilla system identification (SysId). We show 1) superior runtime performance compared to an unrolled solver, 2) the ability of our method to recover the cost and dynamics of a controller with imitation, and 3) the benefit of directly optimizing the task loss over vanilla SysId.
We have released our differentiable MPC solver as a standalone open source package that is available at https://github.com/locuslab/mpc.pytorch and our experimental code for this paper is also openly available at https://github.com/locuslab/differentiablempc
. Our experiments are implemented with PyTorch
(Paszke et al., 2017).5.1 MPC Solver Performance
Figure 2 highlights the performance of our differentiable MPC solver. We compare to an alternative version where each boxconstrained iLQR iteration is individually unrolled, and gradients are computed by differentiating through the entire unrolled chain. As illustrated in the figure, these unrolled operations incur a substantial extra cost. Our differentiable MPC solver 1) is slightly more computationally efficient even in the forward pass, as it does not need to create and maintain the backward pass variables; 2) is more memory efficient in the forward pass for this same reason (by a factor of the number of iLQR iterations); and 3) is significantly more efficient in the backwards pass, especially when a large number of iLQR iterations are needed. The backwards pass is essentially free, as it can reuse all the factorizations for the forward pass and does not require multiple iterations.
5.2 Imitation Learning: LinearDynamics QuadraticCost (LQR)
In this section, we show results to validate the MPC solver and gradientbased learning approach for an imitation learning problem. The expert and learner are LQR controllers that share all information except for the linear system dynamics . The controllers have the same quadratic cost (the identity), control bounds , horizon (5 timesteps), and 3dimensional state and control spaces. Though the dynamics can also be recovered by fitting nextstate transitions, we show that we can alternatively use imitation learning to recover the dynamics using only controls.
Given an initial state , we can obtain nominal actions from the controllers as , where . We randomly initialize the learner’s dynamics with and minimize the imitation loss
which we can uniquely do using only observed controls and no observations. We do learning by differentiating with respect to
(using minibatches with 32 examples) and taking gradient steps with RMSprop
(Tieleman and Hinton, 2012). Figure 3 shows that the learner’s trajectories match the experts by minimizing . Furthermore, the learner also recovers the expert’s parameters by showing that it minimizes the model loss . We note that in general despite the LQR problem being convex, the optimization problem of some loss function w.r.t. the LQR’s parameters is a (potentially difficult) nonconvex optimization problem.5.3 Imitation Learning: NonConvex Continuous Control
We next demonstrate the ability of our method to do imitation learning in the pendulum and cartpole benchmark domains. Despite being simple tasks, they are relatively challenging for a generic poicy to learn quickly in the imitation learning setting. In our experiments we use MPC experts and learners that produce a nominal action sequence where parameterizes the model that’s being optimized. The goal of these experiments is to optimize the imitation loss , again which we can uniquely do using only observed controls and no observations. We consider the following methods:
Baselines: nn is an LSTM that takes the state as input and predicts the nominal action sequence. In this setting we optimize the imitation loss directly. sysid assumes the cost of the controller is known and approximates the parameters of the dynamics by optimizing the nextstate transitions.
Our Methods: mpc.dx assumes the cost of the controller is known and approximates the parameters of the dynamics by directly optimizing the imitation loss. mpc.cost assumes the dynamics of the controller is known and approximates the cost by directly optimizing the imitation loss. mpc.cost.dx approximates both the cost and parameters of the dynamics of the controller by directly optimizing the imitation loss.
In all settings that involve learning the dynamics (sysid, mpc.dx, and mpc.cost.dx) we use a parameterized version of the true dynamics. In the pendulum domain, the parameters are the mass, length, and gravity; and in the cartpole domain, the parameters are the cart’s mass, pole’s mass, gravity, and length. For cost learning in mpc.cost and mpc.cost.dx we parameterize the cost of the controller as the weighted distance to a goal state . We have found that simultaneously learning the weights and goal state is instable and in our experiments we alternate learning of and
independently every 10 epochs. We collected a dataset of trajectories from an expert controller and vary the number of trajectories our models are trained on. A single trial of our experiments takes 12 hours on a modern CPU. We optimize the
nn setting with Adam (Kingma and Ba, 2014) with a learning rate of and all other settings are optimized with RMSprop (Tieleman and Hinton, 2012) with a learning rate of and a decay term of .Figure 4 shows that in nearly every case we are able to directly optimize the imitation loss with respect to the controller and we significantly outperform a general neural network policy trained on the same information. In many cases we are able to recover the true cost function and dynamics of the expert. More information about the training and validation losses are in Appendix B. The comparison between our approach mpc.dx and SysId is notable, as we are able to recover equivalent performance to SysId with our models using only the control information and without using state information.
Again, while we emphasize that these are simple tasks, there are stark differences between the approaches. Unlike the generic networkbased imitation learning, the MPC policy can exploit its inherent structure. Specifically, because the network contains a welldefined notion of the dynamics and cost, it is able to learn with much lower sample complexity that a typical network. But unlike pure system identification (which would be reasonable only for the case where the physical parameters are unknown but all other costs are known), the differentiable MPC policy can naturally be adapted to objectives besides simple state prediction, such as incorporating the additional cost learning portion.
5.4 Imitation Learning: SysId with a nonrealizable expert
All of our previous experiments that involve SysId and learning the dynamics are in the unrealistic case when the expert’s dynamics are in the model class being learned. In this experiment we study a case where the expert’s dynamics are outside of the model class being learned. In this setting we will do imitation learning for the parameters of a dynamics function with vanilla SysId and by directly optimizing the imitation loss (sysid and the mpc.dx in the previous section, respectively).
SysId often fits observations from a noisy environment to a simpler model. In our setting, we collect optimal trajectories from an expert in the pendulum environment that has an additional damping term and also has another force acting on the pointmass at the end (which can be interpreted as a “wind” force). We do learning with dynamics models that do not have these additional terms and therefore we cannot recover the expert’s parameters. Figure 5 shows that even though vanilla SysId is slightly better at optimizing the nextstate transitions, it finds an inferior model for imitation compared to our approach that directly optimizes the imitation loss.
We argue that the goal of doing SysId is rarely in isolation and always serves the purpose of performing a more sophisticated task such as imitation or policy learning. Typically SysId is merely a surrogate for optimizing the task and we claim that the task’s loss signal provides useful information to guide the dynamics learning. Our method provides one way of doing this by allowing the task’s loss function to be directly differentiated with respect to the dynamics function being learned.
6 Conclusion
This paper lays the foundations for differentiating and learning MPCbased controllers within reinforcement learning and imitation learning. Our approach, in contrast to the more traditional strategy of “unrolling” a policy, has the benefit that it is much less computationally and memory intensive, with a backward pass that is essentially free given the number of iterations required for a the iLQR optimizer to converge to a fixed point. We have demonstrated our approach in the context of imitation learning, and have highlighted the potential advantages that that approach brings over generic imitation learning and system identification.
We also emphasize that one of the primary contributions of this paper is to define and set up the framework for differentiating through MPC in general. Given the recent prominence of attempting to incorporate planning and control methods into the loop of deep network architectures, the techniques here offer a method for efficiently integrating MPC policies into such situations, allowing these architectures to make use of a very powerful function class that has proven extremely effective in practice. This has numerous additional applications, including tuning model parameters to taskspecific goals, incorporating joint modelbased and policybased loss functions, and extensions into stochastic settings.
Acknowledgments
BA is supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1252522.
References

Abbeel et al. [2006]
Pieter Abbeel, Morgan Quigley, and Andrew Y Ng.
Using inaccurate models in reinforcement learning.
In
Proceedings of the 23rd international conference on Machine learning
, pages 1–8. ACM, 2006.  Alexis et al. [2011] Kostas Alexis, Christos Papachristos, George Nikolakopoulos, and Anthony Tzes. Model predictive quadrotor indoor position control. In Control & Automation (MED), 2011 19th Mediterranean Conference on, pages 1247–1252. IEEE, 2011.
 Amos and Kolter [2017] Brandon Amos and J Zico Kolter. OptNet: Differentiable Optimization as a Layer in Neural Networks. In Proceedings of the International Conference on Machine Learning, 2017.
 Bansal et al. [2017] Somil Bansal, Roberto Calandra, Sergey Levine, and Claire Tomlin. Mbmf: Modelbased priors for modelfree reinforcement learning. arXiv preprint arXiv:1709.03153, 2017.
 Boedecker et al. [2014] Joschika Boedecker, Jost Tobias Springenberg, Jan Wulfing, and Martin Riedmiller. Approximate realtime optimal control based on sparse gaussian process models. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2014.
 Bouffard et al. [2012] P. Bouffard, A. Aswani, , and C. Tomlin. Learningbased model predictive control on a quadrotor: Onboard implementation and experimental results. In IEEE International Conference on Robotics and Automation, 2012.
 Boyd [2008] Stephen Boyd. Lqr via lagrange multipliers. Stanford EE 363: Linear Dynamical Systems, 2008. URL http://stanford.edu/class/ee363/lectures/lqrlagrange.pdf.
 Chebotar et al. [2017] Yevgen Chebotar, Karol Hausman, Marvin Zhang, Gaurav Sukhatme, Stefan Schaal, and Sergey Levine. Combining modelbased and modelfree updates for trajectorycentric reinforcement learning. arXiv preprint arXiv:1703.03078, 2017.
 Deisenroth and Rasmussen [2011] Marc Deisenroth and Carl E Rasmussen. Pilco: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pages 465–472, 2011.
 Erez et al. [2012] T. Erez, Y. Tassa, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In International Conference on Intelligent Robots and Systems, 2012.
 Farquhar et al. [2017] Gregory Farquhar, Tim Rocktäschel, Maximilian Igl, and Shimon Whiteson. Treeqn and atreec: Differentiable tree planning for deep reinforcement learning. arXiv preprint arXiv:1710.11417, 2017.
 González et al. [2011] Ramón González, Mirko Fiacchini, José Luis Guzmán, Teodoro Álamo, and Francisco Rodríguez. Robust tubebased predictive control for mobile robots in offroad conditions. Robotics and Autonomous Systems, 59(10):711–726, 2011.
 Gu et al. [2016a] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E Turner, and Sergey Levine. Qprop: Sampleefficient policy gradient with an offpolicy critic. arXiv preprint arXiv:1611.02247, 2016a.
 Gu et al. [2016b] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, and Sergey Levine. Continuous deep qlearning with modelbased acceleration. In Proceedings of the International Conference on Machine Learning, 2016b.
 Heess et al. [2015] Nicolas Heess, Gregory Wayne, David Silver, Tim Lillicrap, Tom Erez, and Yuval Tassa. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pages 2944–2952, 2015.
 Kamel et al. [2015] Mina Kamel, Kostas Alexis, Markus Achtelik, and Roland Siegwart. Fast nonlinear model predictive control for multicopter attitude tracking on so (3). In Control Applications (CCA), 2015 IEEE Conference on, pages 1160–1166. IEEE, 2015.

Karkus et al. [2017]
Peter Karkus, David Hsu, and Wee Sun Lee.
Qmdpnet: Deep learning for planning under partial observability.
In Advances in Neural Information Processing Systems, pages 4697–4707, 2017.  Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Lenz et al. [2015] Ian Lenz, Ross A Knepper, and Ashutosh Saxena. Deepmpc: Learning deep latent features for model predictive control. In Robotics: Science and Systems, 2015.
 Levine [2017] Sergey Levine. Optimal control and planning. Berkeley CS 294112: Deep Reinforcement Learning, 2017. URL http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_8_model_based_planning.pdf.
 Levine and Abbeel [2014] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.
 Levine et al. [2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
 Li and Todorov [2004] Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. 2004.
 Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Liniger et al. [2014] Alexander Liniger, Alexander Domahidi, and Manfred Morari. Optimizationbased autonomous racing of 1:43 scale rc cars. In Optimal Control Applications and Methods, pages 628–647, 2014.
 Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 Nagabandi et al. [2017] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, and Sergey Levine. Neural network dynamics for modelbased deep reinforcement learning with modelfree finetuning. In arXiv preprint arXiv:1708.02596, 2017.
 Neunert et al. [2016] Michael Neunert, Cedric de Crousaz, Fardi Furrer, Mina Kamel, Farbod Farshidian, Roland Siegwart, and Jonas Buchli. Fast Nonlinear Model Predictive Control for Unified Trajectory Optimization and Tracking. In ICRA, 2016.
 Oh et al. [2016] Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, and Honglak Lee. Control of memory, active perception, and action in minecraft. Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
 Oh et al. [2017] Junhyuk Oh, Satinder Singh, and Honglak Lee. Value prediction network. In Advances in Neural Information Processing Systems, pages 6120–6130, 2017.
 Okada et al. [2017] Masashi Okada, Luca Rigazio, and Takenobu Aoshima. Path integral networks: Endtoend differentiable optimal control. arXiv preprint arXiv:1706.09597, 2017.
 Pascanu et al. [2017] Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racanière, David Reichert, Théophane Weber, Daan Wierstra, and Peter Battaglia. Learning modelbased planning from scratch. arXiv preprint arXiv:1707.06170, 2017.
 Paszke et al. [2017] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 Pathak et al. [2018] Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. Zeroshot visual imitation. arXiv preprint arXiv:1804.08606, 2018.
 Pereira et al. [2018] Marcus Pereira, David D. Fan, Gabriel Nakajima An, and Evangelos Theodorou. Mpcinspired neural network policies for sequential decision making. arXiv preprint arXiv:1802.05803, 2018.
 Pong et al. [2018] Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Modelfree deep rl for modelbased control. arXiv preprint arXiv:1802.09081, 2018.

Schneider [1997]
Jeff G Schneider.
Exploiting model uncertainty estimates for safe dynamic control learning.
In Advances in neural information processing systems, pages 1047–1053, 1997.  Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 1889–1897, 2015.
 Schulman et al. [2016] John Schulman, Philpp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. International Conference on Learning Representations, 2016.
 Silver et al. [2016] David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel DulacArnold, David Reichert, Neil Rabinowitz, Andre Barreto, et al. The predictron: Endtoend learning and planning. arXiv preprint arXiv:1612.08810, 2016.
 Srinivas et al. [2018] Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. arXiv preprint arXiv:1804.00645, 2018.
 Sun et al. [2017] Liting Sun, Cheng Peng, Wei Zhan, and Masayoshi Tomizuka. A fast integrated planning and control framework for autonomous driving via imitation learning. In arXiv preprint arXiv:1707.02515, 2017.
 Sutton [1990] Richard S Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the seventh international conference on machine learning, pages 216–224, 1990.
 Tamar et al. [2016] Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
 Tamar et al. [2017] Aviv Tamar, Garrett Thomas, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Learning from the hindsight plan—episodic mpc improvement. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 336–343. IEEE, 2017.
 Tassa et al. [2014] Yuval Tassa, Nicolas Mansard, and Emo Todorov. Controllimited differential dynamic programming. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pages 1168–1175. IEEE, 2014.
 Theodorou et al. [2010] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal. A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11(Nov):3137–3181, 2010.
 Tieleman and Hinton [2012] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 Venkatraman et al. [2016] Arun Venkatraman, Roberto Capobianco, Lerrel Pinto, Martial Hebert, Daniele Nardi, and J Andrew Bagnell. Improved learning of dynamics models for control. In International Symposium on Experimental Robotics, pages 703–713. Springer, 2016.
 Watter et al. [2015] Manuel Watter, Jost Springenberg, Joschka Boedecker, and Martin Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
 Weber et al. [2017] Théophane Weber, Sébastien Racanière, David P Reichert, Lars Buesing, Arthur Guez, Danilo Jimenez Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, et al. Imaginationaugmented agents for deep reinforcement learning. arXiv preprint arXiv:1707.06203, 2017.
 Williams et al. [2016] Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive driving with model predictive path integral control. In Robotics and Automation (ICRA), 2016 IEEE International Conference on, pages 1433–1440. IEEE, 2016.
 Williams et al. [2017] Grady Williams, Andrew Aldrich, and Evangelos A Theodorou. Model predictive path integral control: From theory to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017.
 Xie et al. [2017] Zhaoming Xie, C. Karen Liu, and Kris Hauser. Differential Dynamic Programming with Nonlinear Constraints. In International Conference on Robotics and Automation (ICRA), 2017.
Comments
There are no comments yet.