典型文献
LIDAR:learning from imperfect demonstrations with advantage rectification
文献摘要:
In actor-critic reinforcement learning (RL) algo-rithms,function estimation errors are known to cause ineffec-tive random exploration at the beginning of training,and lead to overestimated value estimates and suboptimal policies.In this paper,we address the problem by executing advantage rectifi-cation with imperfect demonstrations,thus reducing the func-tion estimation errors.Pretraining with expert demonstrations has been widely adopted to accelerate the learning process of deep reinforcement learning when simulations are expensive to obtain.However,existing methods,such as behavior cloning,often assume the demonstrations contain other information or labels with regard to performances,such as optimal assumption,which is usually incorrect and useless in the real world.In this paper,we explicitly handle imperfect demonstrations within the actor-critic RL frameworks,and propose a new method called learning from imperfect demonstrations with advantage recti-fication (LIDAR).LIDAR utilizes a rectified loss function to merely learn from selective demonstrations,which is derived from a minimal assumption that the demonstrating policies have better performances than our current policy.LIDAR learns from contradictions caused by estimation errors,and in turn reduces estimation errors.We apply LIDAR to three popular actor-critic algorithms,DDPG,TD3 and SAC,and experiments show that our method can observably reduce the function esti-mation errors,effectively leverage demonstrations far from the optimal,and outperform state-of-the-art baselines consistently in all the scenarios.
文献关键词:
中图分类号:
作者姓名:
Xiaoqin ZHANG;Huimin MA;Xiong LUO;Jian YUAN
作者机构:
Department of EE,Tsinghua University,Beijing 100084,China;School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China
文献出处:
引用格式:
[1]Xiaoqin ZHANG;Huimin MA;Xiong LUO;Jian YUAN-.LIDAR:learning from imperfect demonstrations with advantage rectification)[J].计算机科学前沿,2022(01):53-62
A类:
ineffec,rectifi,Pretraining,observably
B类:
LIDAR,learning,from,imperfect,demonstrations,advantage,rectification,In,actor,critic,reinforcement,RL,function,estimation,errors,are,known,random,exploration,beginning,lead,overestimated,value,estimates,suboptimal,policies,this,paper,address,problem,by,executing,thus,reducing,expert,has,been,widely,adopted,accelerate,process,deep,when,simulations,expensive,obtain,However,existing,methods,such,behavior,cloning,often,assume,contain,other,information,labels,regard,performances,assumption,which,usually,incorrect,useless,real,world,explicitly,handle,within,frameworks,propose,new,called,utilizes,rectified,loss,merely,selective,derived,minimal,that,demonstrating,have,better,than,our,current,policy,learns,contradictions,caused,turn,reduces,We,apply,three,popular,algorithms,DDPG,TD3,SAC,experiments,show,can,effectively,leverage,far,outperform,state,art,baselines,consistently,scenarios
AB值:
0.519248
相似文献
机标中图分类号,由域田数据科技根据网络公开资料自动分析生成,仅供学习研究参考。