首站-论文投稿智能助手
典型文献
Choice of discount rate in reinforcement learning with long-delay rewards
文献摘要:
In the world, most of the successes are results of long-term efforts. The reward of success is extremely high, but be-fore that, a long-term investment process is required. People who are "myopic" only value short-term rewards and are unwill-ing to make early-stage investments, so they hardly get the ulti-mate success and the corresponding high rewards. Similarly, for a reinforcement learning (RL) model with long-delay rewards, the discount rate determines the strength of agent's "farsightedness". In order to enable the trained agent to make a chain of correct choices and succeed finally, the feasible region of the discount rate is obtained through mathematical derivation in this paper firstly. It satisfies the "farsightedness" requirement of agent. Af-terwards, in order to avoid the complicated problem of solving implicit equations in the process of choosing feasible solutions, a simple method is explored and verified by theoreti cal demon-stration and mathematical experiments. Then, a series of RL ex-periments are designed and implemented to verify the validity of theory. Finally, the model is extended from the finite process to the infinite process. The validity of the extended model is veri-fied by theories and experiments. The whole research not only reveals the significance of the discount rate, but also provides a theoretical basis as well as a practical method for the choice of discount rate in future researches.
文献关键词:
作者姓名:
LIN Xiangyang;XING Qinghua;LIU Fuxian
作者机构:
Department of Air Defense and Anti-Missile,Air Force Engineering University,Xi'an 710051,China
引用格式:
[1]LIN Xiangyang;XING Qinghua;LIU Fuxian-.Choice of discount rate in reinforcement learning with long-delay rewards)[J].系统工程与电子技术(英文版),2022(02):381-392
A类:
unwill,farsightedness,terwards
B类:
Choice,discount,rate,reinforcement,learning,long,delay,rewards,In,world,most,successes,are,results,efforts,extremely,high,but,be,fore,that,process,required,People,myopic,only,value,short,make,early,stage,investments,they,hardly,get,ulti,mate,corresponding,Similarly,RL,model,determines,strength,agent,order,enable,trained,chain,correct,choices,succeed,finally,feasible,region,obtained,through,mathematical,derivation,this,paper,firstly,It,satisfies,requirement,Af,avoid,complicated,problem,solving,implicit,equations,choosing,solutions,simple,method,explored,verified,by,demon,stration,experiments,Then,series,designed,implemented,verify,validity,theory,Finally,extended,from,infinite,theories,whole,not,reveals,significance,also,provides,theoretical,basis,well,practical,future,researches
AB值:
0.513915
相似文献
Dynamic analysis of heat extraction rate by supercritical carbon dioxide in fractured rock mass based on a thermal-hydraulic-mechanics coupled model
Chunguang Wang;Xingkai Shi;Wei Zhang;Derek Elsworth;Guanglei Cui;Shuqing Liu;Hongxu Wang;Weiqiang Song;Songtao Hu;Peng Zheng-College of Energy and Mining Engineering,Shandong University of Science and Technology,Qingdao 266590,China;New-energy Development Center of Sinopec Shengli Oilfield,Dongying 257001,China;Energy and Mineral Engineering and G3 Center,Penn State University,University Park,PA 16802,USA;Key Laboratory of Ministry of Education on Safe Mining of Deep Metal Mines,Northeastern University,Shenyang 110004,China;Shandong Provincial Geo-Mineral Engineering Co.,Ltd,Jinan 250013,China;Qingdao Wofu New Energy Science and Technology Co.,Ltd,Qingdao 266010,China
机标中图分类号,由域田数据科技根据网络公开资料自动分析生成,仅供学习研究参考。