强化学习ReinforcementLearning.ppt

上传人:ga****84 文档编号:462107 上传时间:2018-10-09 格式:PPT 页数:100 大小:3.51MB
下载 相关 举报
强化学习ReinforcementLearning.ppt_第1页
第1页 / 共100页
强化学习ReinforcementLearning.ppt_第2页
第2页 / 共100页
强化学习ReinforcementLearning.ppt_第3页
第3页 / 共100页
强化学习ReinforcementLearning.ppt_第4页
第4页 / 共100页
强化学习ReinforcementLearning.ppt_第5页
第5页 / 共100页
点击查看更多>>
资源描述

1、高级人工智能第十章,史忠植 中国科学院计算技术研究所,强化学习Reinforcement Learning,2018/10/9,史忠植 强化学习,2,内容提要,引言强化学习模型动态规划蒙特卡罗方法时序差分学习Q学习强化学习中的函数估计应用,2018/10/9,史忠植 强化学习,3,引言,人类通常从与外界环境的交互中学习。所谓强化(reinforcement)学习是指从环境状态到行为映射的学习,以使系统行为从环境中获得的累积奖励值最大。在强化学习中,我们设计算法来把外界环境转化为最大化奖励量的方式的动作。我们并没有直接告诉主体要做什么或者要采取哪个动作,而是主体通过看哪个动作得到了最多的奖励来自

2、己发现。主体的动作的影响不只是立即得到的奖励,而且还影响接下来的动作和最终的奖励。试错搜索(trial-and-error search)和延期强化(delayed reinforcement)这两个特性是强化学习中两个最重要的特性。,2018/10/9,史忠植 强化学习,4,引言,强化学习技术是从控制理论、统计学、心理学等相关学科发展而来,最早可以追溯到巴甫洛夫的条件反射实验。但直到上世纪八十年代末、九十年代初强化学习技术才在人工智能、机器学习和自动控制等领域中得到广泛研究和应用,并被认为是设计智能系统的核心技术之一。特别是随着强化学习的数学基础研究取得突破性进展后,对强化学习的研究和应用日

3、益开展起来,成为目前机器学习领域的研究热点之一。,2018/10/9,史忠植 强化学习,5,引言,强化思想最先来源于心理学的研究。1911年Thorndike提出了效果律(Law of Effect):一定情景下让动物感到舒服的行为,就会与此情景增强联系(强化),当此情景再现时,动物的这种行为也更易再现;相反,让动物感觉不舒服的行为,会减弱与情景的联系,此情景再现时,此行为将很难再现。换个说法,哪种行为会“记住”,会与刺激建立联系,取决于行为产生的效果。动物的试错学习,包含两个含义:选择(selectional)和联系(associative),对应计算上的搜索和记忆。所以,1954年,Min

4、sky在他的博士论文中实现了计算上的试错学习。同年,Farley和Clark也在计算上对它进行了研究。强化学习一词最早出现于科技文献是1961年Minsky 的论文“Steps Toward Artificial Intelligence”,此后开始广泛使用。1969年, Minsky因在人工智能方面的贡献而获得计算机图灵奖。,2018/10/9,史忠植 强化学习,6,引言,1953到1957年,Bellman提出了求解最优控制问题的一个有效方法:动态规划(dynamic programming) Bellman于 1957年还提出了最优控制问题的随机离散版本,就是著名的马尔可夫决策过程(MD

5、P, Markov decision processe),1960年Howard提出马尔可夫决策过程的策略迭代方法,这些都成为现代强化学习的理论基础。1972年,Klopf把试错学习和时序差分结合在一起。1978年开始,Sutton、Barto、 Moore,包括Klopf等对这两者结合开始进行深入研究。 1989年Watkins提出了Q-学习Watkins 1989,也把强化学习的三条主线扭在了一起。1992年,Tesauro用强化学习成功了应用到西洋双陆棋(backgammon)中,称为TD-Gammon 。,2018/10/9,史忠植 强化学习,7,内容提要,引言强化学习模型动态规划蒙特

6、卡罗方法时序差分学习Q学习强化学习中的函数估计应用,2018/10/9,史忠植 强化学习,8,主体,强化学习模型,i: inputr: reward s: state,a: action,状态 si,si+1,ri+1,奖励 ri,环境,动作 ai,2018/10/9,史忠植 强化学习,9,描述一个环境(问题),Accessible vs. inaccessibleDeterministic vs. non-deterministicEpisodic vs. non-episodicStatic vs. dynamicDiscrete vs. continuous,The most compl

7、ex general class of environments are inaccessible, non-deterministic, non-episodic, dynamic, and continuous.,2018/10/9,史忠植 强化学习,10,强化学习问题,Agent-environment interactionStates, Actions, RewardsTo define a finite MDPstate and action sets : S and Aone-step “dynamics” defined by transition probabilities

8、(Markov Property):reward probabilities:,2018/10/9,史忠植 强化学习,11,与监督学习对比,Reinforcement Learning Learn from interactionlearn from its own experience, and the objective is to get as much reward as possible. The learner is not told which actions to take, but instead must discover which actions yield the m

9、ost reward by trying them.,RLSystem,Inputs,Outputs (“actions”),Training Info = evaluations (“rewards” / “penalties”),Supervised Learning Learn from examples provided by a knowledgable external supervisor.,2018/10/9,史忠植 强化学习,12,强化学习要素,Policy: stochastic rule for selecting actionsReturn/Reward: the fu

10、nction of future rewards agent tries to maximizeValue: what is good because it predicts rewardModel: what follows what,Is unknown,Is my goal,Is I can get,Is my method,2018/10/9,史忠植 强化学习,13,在策略下的Bellman公式,The basic idea:,So:,Or, without the expectation operator:,is the discount rate,2018/10/9,史忠植 强化学

11、习,14,Bellman最优策略公式,其中:V*:状态值映射S: 环境状态R: 奖励函数P: 状态转移概率函数: 折扣因子,2018/10/9,史忠植 强化学习,15,马尔可夫决策过程 MARKOV DECISION PROCESS,由四元组定义。 环境状态集S 系统行为集合A 奖励函数R:SA 状态转移函数P:SAPD(S) 记R(s,a,s)为系统在状态s采用a动作使环境状态转移到s获得的瞬时奖励值;记P(s,a,s)为系统在状态s采用a动作使环境状态转移到s的概率。,2018/10/9,史忠植 强化学习,16,马尔可夫决策过程 MARKOV DECISION PROCESS,马尔可夫决策

12、过程的本质是:当前状态向下一状态转移的概率和奖励值只取决于当前状态和选择的动作,而与历史状态和历史动作无关。因此在已知状态转移概率函数P和奖励函数R的环境模型知识下,可以采用动态规划技术求解最优策略。而强化学习着重研究在P函数和R函数未知的情况下,系统如何学习最优行为策略。,2018/10/9,史忠植 强化学习,17,MARKOV DECISION PROCESS,Characteristics of MDP:a set of states : Sa set of actions : Aa reward function :R : S x A RA state transition funct

13、ion:T: S x A ( S) T (s,a,s): probability of transition from s to s using action a,2018/10/9,史忠植 强化学习,18,马尔可夫决策过程 MARKOV DECISION PROCESS,2018/10/9,史忠植 强化学习,19,MDP EXAMPLE:,Transitionfunction,States and rewards,Bellman Equation:,(Greedy policy selection),2018/10/9,史忠植 强化学习,20,MDP Graphical Representa

14、tion, : T (s, action, s ),Similarity to Hidden Markov Models (HMMs),2018/10/9,史忠植 强化学习,21,Reinforcement Learning ,Deterministic transitionsStochastic transitions,is the probability to reaching state j when taking action a in state i,A simple environment that presents the agent with a sequential deci

15、sion problem:,Move cost = 0.04,(Temporal) credit assignment problem sparse reinforcement problemOffline alg: action sequences determined ex anteOnline alg: action sequences is conditional on observations along the way; Important in stochastic environment (e.g. jet flying),2018/10/9,史忠植 强化学习,22,Reinf

16、orcement Learning ,M = 0.8 in direction you want to go 0.2 in perpendicular,Policy: mapping from states to actions,An optimal policy for the stochastic environment:,utilities of states:,Markov property: Transition probabilities depend on state only, not on the path to the state.Markov decision probl

17、em (MDP).Partially observable MDP (POMDP): percepts does not have enough info to identify transition probabilities.,2018/10/9,史忠植 强化学习,23,动态规划Dynamic Programming,动态规划(dynamic programming)的方法通过从后继状态回溯到前驱状态来计算赋值函数。动态规划的方法基于下一个状态分布的模型来接连的更新状态。强化学习的动态规划的方法是基于这样一个事实:对任何策略和任何状态s,有下式迭代的一致的等式成立,(as)是给定在随机策略

18、下状态s时动作a的概率。(ssa)是在动作a下状态s转到状态s的概率。这就是对V的Bellman(1957)等式。,2018/10/9,史忠植 强化学习,24,动态规划Dynamic Programming - Problem,A discrete-time dynamic systemStates 1, , n + termination state 0Control U(i)Transition Probability pij(u)Accumulative cost structurePolicies,2018/10/9,史忠植 强化学习,25,Finite Horizon Problem

19、Infinite Horizon ProblemValue Iteration,动态规划Dynamic Programming Iterative Solution,2018/10/9,史忠植 强化学习,26,动态规划中的策略迭代/值迭代,Policy Iteration,Value Iteration,2018/10/9,史忠植 强化学习,27,动态规划方法,2018/10/9,史忠植 强化学习,28,自适应动态规划(ADP),Idea: use the constraints (state transition probabilities) between states to speed

20、learning.Solve,= value determination.No maximization over actions because agent is passive unlike in value iteration.,using DP,Large state spacee.g. Backgammon: 1050 equations in 1050 variables,2018/10/9,史忠植 强化学习,29,Value Iteration Algorithm,AN ALTERNATIVE ITERATION: (Singh,1993),(Important for mode

21、l free learning),Stop Iteration when V(s) differs less than .Policy difference ratio =0 (Assumption of exploring starts ),蒙特卡罗方法,2018/10/9,史忠植 强化学习,38,蒙特卡罗控制,How to select Policies:(Similar to policy evaluation),MC policy iteration: Policy evaluation using MC methods followed by policy improvement P

22、olicy improvement step: greedify with respect to value (or action-value) function,2018/10/9,史忠植 强化学习,39,时序差分学习 Temporal-Difference,时序差分学习中没有环境模型,根据经验学习。每步进行迭代,不需要等任务完成。预测模型的控制算法,根据历史信息判断将来的输入和输出,强调模型的函数而非模型的结构。时序差分方法和蒙特卡罗方法类似,仍然采样一次学习循环中获得的瞬时奖惩反馈,但同时类似与动态规划方法采用自举方法估计状态的值函数。然后通过多次迭代学习,去逼近真实的状态值函数。,20

23、18/10/9,史忠植 强化学习,40,时序差分学习 TD,2018/10/9,史忠植 强化学习,41,时序差分学习 Temporal-Difference,target: the actual return after time t,target: an estimate of the return,2018/10/9,史忠植 强化学习,42,时序差分学习 (TD),Idea: Do ADP backups on a per move basis, not for the whole state space.,Theorem: Average value of U(i) converges

24、to the correct value.,Theorem: If is appropriately decreased as a function of times a state is visited (=Ni), then U(i) itself converges to the correct value,2018/10/9,史忠植 强化学习,43,TD(l) A Forward View,TD(l) is a method for averaging all n-step backups weight by ln-1 (time since visitation)l-return:

25、Backup using l-return:,2018/10/9,史忠植 强化学习,44,时序差分学习算法 TD(),Idea: update from the whole epoch, not just on state transition.,Special cases:=1: Least-mean-square (LMS), Mont Carlo=0: TDIntermediate choice of (between 0 and 1) is best. Interplay with ,2018/10/9,史忠植 强化学习,45,时序差分学习算法 TD(),算法 10.1 TD(0)学习

26、算法Initialize V(s) arbitrarily, to the policy to be evaluatedRepeat (for each episode) Initialize s Repeat (for each step of episode) Choose a from s using policy derived from V(e.g., -greedy) Take action a, observer r, s Until s is terminal,2018/10/9,史忠植 强化学习,46,时序差分学习算法,2018/10/9,史忠植 强化学习,47,时序差分学习

27、算法收敛性TD(),Theorem: Converges w.p. 1 under certain boundaries conditions.Decrease i(t) s.t.,In practice, often a fixed is used for all i and t.,2018/10/9,史忠植 强化学习,48,时序差分学习 TD,2018/10/9,史忠植 强化学习,49,Q-learning,Watkins, 1989在Q学习中,回溯从动作结点开始,最大化下一个状态的所有可能动作和它们的奖励。在完全递归定义的Q学习中,回溯树的底部结点一个从根结点开始的动作和它们的后继动作的

28、奖励的序列可以到达的所有终端结点。联机的Q学习,从可能的动作向前扩展,不需要建立一个完全的世界模型。Q学习还可以脱机执行。我们可以看到,Q学习是一种时序差分的方法。,2018/10/9,史忠植 强化学习,50,Q-learning,在Q学习中,Q是状态-动作对到学习到的值的一个函数。对所有的状态和动作: Q: (state x action) value 对Q学习中的一步:,(10.15),其中c和都1,rt+1是状态st+1的奖励。,2018/10/9,史忠植 强化学习,51,Q-Learning,Estimate the Q-function using some approximator

29、 (for example, linear regression or neural networks or decision trees etc.).Derive the estimated policy as an argument of the maximum of the estimated Q-function.Allow different parameter vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and

30、 of course, squared error as the appropriate loss function.,2018/10/9,史忠植 强化学习,52,Q-learning,2018/10/9,史忠植 强化学习,53,探查Exploration,在使用(控制)和探查(标识)之间寻求折中,Extremes: greedy vs. random acting(n-armed bandit models)Q-learning将收敛到最佳的Q 估值,如果*(由于探查),每个状态可以被无限地访问,*当时间接近无限时,动作选择变得贪婪,和*学习速率a 被减少足够快但不是太迅速(我们在TD 学习

31、过程中讨论),2018/10/9,史忠植 强化学习,54,常用探查方法,In value iteration in an ADP agent: Optimistic estimate of utility U+(i)-greedy methodNongreedy actions Greedy actionBoltzmann exploration,2018/10/9,史忠植 强化学习,55,Q-Learning 算法,Q学习算法Initialize Q(s,a) arbitrarilyRepeat (for each episode) Initialize s Repeat (for each

32、 step of episode) Choose a from s using policy derived from Q (e.g., -greedy)Take action a, observer r, s Until s is terminal,2018/10/9,史忠植 强化学习,56,Q-Learning 算法,SetForThe estimated policy satisfies,2018/10/9,史忠植 强化学习,57,直觉是什么?,Bellman equation gives If and the training set were infinite, then Q-lea

33、rning minimizes which is equivalent to minimizing,2018/10/9,史忠植 强化学习,58,A-Learning,Murphy, 2003 and Robins, 2004Estimate the A-function (advantages) using some approximator, as in Q-learning.Derive the estimated policy as an argument of the maximum of the estimated A-function.Allow different paramet

34、er vectors at different time points.Let us illustrate the algorithm with linear regression as the approximator, and of course, squared error as the appropriate loss function.,2018/10/9,史忠植 强化学习,59,A-Learning Algorithm (Inefficient Version),ForThe estimated policy satisfies,2018/10/9,史忠植 强化学习,60,Q-Le

35、arning and A-learning差别,Q-learningAt time t we model the main effects of the history, (St,At-1) and the action At and their interactionOur Yt-1 is affected by how we modeled the main effect of the history in time t, (St,At-1) A-learningAt time t we only model the effects of At and its interaction wi

36、th (St,At-1) Our Yt-1 does not depend on a model of the main effect of the history in time t, (St,At-1),2018/10/9,史忠植 强化学习,61,Q-Learning Vs. A-Learning,直到现在有关的优点和缺点没被完全知道。Q-learning有低的分散, 高的偏置。A-learning有高的分散, 低的偏置。比较Q-learning与A-learning必须在分散与偏置间求折中。,史忠植 强化学习,High-level robot behavior control using

37、Partially Observable Markov Decision Processes,USER + WORLD + ROBOT,ACTIONS,OBSERVATIONS,BELIEF STATE,STATE,62,2018/10/9,2018/10/9,史忠植 强化学习,63,POMDP部分感知马氏决策过程,Rather than observing the state we observe some function of the state.Ob Observable functiona random variable for each states.Problem : diffe

38、rent states may look similar,The optimal strategy might need to consider the history.,2018/10/9,史忠植 强化学习,64,Framework of POMDP,POMDP由六元组定义。其中定义了环境潜在的马尔可夫决策模型上,是观察的集合,即系统可以感知的世界状态集合,观察函数:SAPD()。系统在采取动作a转移到状态s时,观察函数确定其在可能观察上的概率分布。记为(s, a, o)。,1 可以是S的子集,也可以与S无关,2018/10/9,史忠植 强化学习,65,POMDPs,What if stat

39、e information (from sensors) is noisy?Mostly the case!,MDP techniques are suboptimal!Two halls are not the same.,2018/10/9,史忠植 强化学习,66,POMDPs A Solution Strategy,SE: Belief State Estimator (Can be based on HMM): MDP Techniques,2018/10/9,史忠植 强化学习,67,POMDP_信度状态方法,Idea: Given a history of actions and o

40、bservable value, we compute a posterior distribution for the state we are in (belief state)The belief-state MDPStates: distribution over S (states of the POMDP)Actions: as in POMDPTransition: the posterior distribution (given the observation),Open Problem : How to deal with the continuous distributi

41、on?,2018/10/9,史忠植 强化学习,68,The Learning Process of Belief MDP,2018/10/9,史忠植 强化学习,69,Major Methods to Solve POMDP,2018/10/9,史忠植 强化学习,70,强化学习中的函数估计,Generalization of the value function to the entire state space,is the TD operator.,is the function approximation operator.,2018/10/9,史忠植 强化学习,71,并行两个迭代过程,值函数迭代过程值函数逼近过程,How to construct the M function? Using state cluster, interpolation, decision tree or neural network?,2018/10/9,史忠植 强化学习,72,Function Approximator: V( s) = f( s, w)Update: Gradient-descent Sarsa: w w + a rt+1 + g Q(st+1,at+1)- Q(st,at) w f(st,at,w),weight vector,

展开阅读全文
相关资源
相关搜索

当前位置:首页 > 学术论文资料库 > 毕业论文

Copyright © 2018-2021 Wenke99.com All rights reserved

工信部备案号浙ICP备20026746号-2  

公安局备案号:浙公网安备33038302330469号

本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。