EfficientImplementationofanAsymptoticApproximation
Fr´ed´erickGarciaandFlorentSerre
INRA-Unit´edeBiom´etrieetIntelligenceArtificielle.BP27,Auzeville.31326CastanetTolosancedex,France.
fgarcia@toulouse.inra.fr
Abstract
Q()isareinforcementlearningalgorithmthatcombinesQ-learningandTD().Onlineimple-mentationsofQ()thatuseeligibilitytraceshavebeenshowntospeedbasicQ-learning.Inthispa-perwepresentanasymptoticanalysisofWatkins’Q()withaccumulativeeligibilitytraces.WefirstintroduceanasymptoticapproximationofQ()thatappearstobeagainmatrixvariantofbasicQ-learning.UsingtheODEmethod,wethendeter-mineanoptimalgainmatrixforQ-learningthatmaximizesitsrateofconvergencetowardtheop-timalvaluefunction.ThesimilaritybetweenthisoptimalgainandtheasymptoticgainofQ()explainstherelativeefficiencyofthetween.Furthermore,thesetwogains,byminimizingoptimalvaluesthedifferencelatterforforthebe-pa-rameterandthedecreasinglearningratescanbede-termined.Thisoptimalstronglydependsontheexplorationpolicyduringlearning.Arobustap-proximationoftheselearningparametersleadstothedefinitionofanewefficientalgorithmcalledAQ-learning(AverageQ-learning),thatshowsacloseresemblancetoSchwartz’R-learning.Ourresultshavebeendemonstratedthroughnumericalsimulations.
1Introduction
TD()andQ-learningareundoubtedlytwoofthemostim-portantmethodsthathavebeenproposedinReinforcementLearning.Q-learningisasimpledirectalgorithmforlearningoptimalpolicies,andtheTD()ruleisusedinmanyexistingreinforcementlearningandstochasticdynamicprogrammingalgorithmsforefficientlyevaluatingagivenpolicy.
TheQ-learningalgorithmbasicallyimplementstheTD(0)updaterule.InordertoimprovetheconvergencerateofQ-learning,ithasbeenproposedtointegratetheTD(),
,withtheQ-learningupdaterule.Thatresultedinthefam-ilyofQ()algorithms[Watkins,19;PengandWilliams,
1994].ItiscommonlybelievedthatQ(),
,improvestheQ-learningbehaviour,butnodefinitetheoreticalanalysisofQ()hascomeinsupportofthisconjecture:thereisnoproofofconvergenceforQ(),andasforTD(),theissue
oftherationalchoiceof]remains[SinghandDayan,1998;KearnsandSingh,2000.
TheuseofQ()ishamperedbyanotherimportantdif-ficulty.EligibilitytraceimplementationsofQ()basedonlook-uptablesrequiremanyupdatesofQ-valueandtracecomponentsaftereachtransition.Thetimecomplexityofthisapproachmaybeveryhighforlargestateandactionspaces.Inspiteofseveralrecentattempts[Cichosz,1995;WieringandSchmidhuber,1997],themulti-componentup-dateruleofQ()remainsproblematic.
Inthispaperweproposeanewalgorithmthateliminatetheabove-mentionedshortcomingofQ().Ourapproachre-liesonanasymptoticapproximationofWatkins’Q()(sec-tion2and3),thatappearstobeagainmatrixvariantofbasicQ-learning.Thenanoptimalconvergencerateanaly-sisofQ-learningbasedontheODEmethodleadsustode-termineanoptimalgainmatrixforQ-learning(section4).ThesimilarstructureofthesetwogainsexplainstherelativeefficiencyofQ().Furthermore,byminimizingthediffer-encebetweenthesetwogains,weareabletodeterminenear-optimalandlearningratesthatallowenefficientimplemen-tationofanewreinforcementlearningalgorithmcalledAQ-learning(averageQ-learning),whichshowsacloseresem-blancetoSchwartz’R-learning[Schwartz,1993](section5).Wepresentinsection6someexperimentalresultsthatcon-firmthesoundnessoftheasymptoticanalysisandthenear-optimalbehaviourofAQ-learning.
2TheWatkins’Q()algorithm
LiketheTD()andQ-learningreinforcementlearningal-gorithms,Q()canbedescribedwithintheframeworkofMarkovDecisionProcesses(MDP)[Puterman,1994].As-sumeastationaryinfinite-horizonMDPmodelthatisdefinedbyastatespaceofsizeandanactionspaceofsize,byaMarkoviandynamicsofonmovingcharacterizedfromtobythetran-sitionprobabilitiesafterap-plyingtheactionassociated,andbythetoeachinstantaneoustransitionrewardfunctions
.
Apolicytoisanyafunctionpossiblestate.Giventhataninitialassignsstateanaction
,fol-lowingapolicydefinesaccordingtotheprobabilitieswitharandom,trajectory
.
,
TheoptimizationproblemassociatedtoadiscountedMarkovDecisionProblemistofindapolicythatmaximizesforanyinitialstatetheexpectedsumofthediscountedrewardsalongisthediscountfactor.
,where
atrajectory
Thevaluefunction
ofapolicycanbedirectlylearntfromobservedtrajectoriesbyusingtheTD()algorithm,withoutmaintainingities
[directlyanestimationSutton,anestimationofthe1988]optimal.Itofisthevaluealsotransitionpossibleprobabil-functiontooflearntheproblemwiththeQ-learningalgorithm[Watkins,19].Q-learningregularlyupdatesanestimationoftheoptimalQ-valuefunctiondenotedby:
whichischaracterizedbytheoptimalityequations:
Anoptimalpolicycanthenbe,directlyderivedfrom
by
.
TheprincipleofQ()istocombineQ-learningandthetemporaldifferenceerrorusedintheTD()method,inor-dertoacceleratetheconvergenceofQ-learning.Forthediscountedvaluefunction,thepotentiallyinfinitelengthofthetrajectoriesleadsustoconsidertheQ()methodsbasedoneligibilitytraces.Aftereachobservedtransi-tion
valuefunctionby
,thesealgorithmsupdatetheestimated(1)
wherethecurrenttemporaldifference,is
Usually,istheisthestepsizeand
stepsizeseligibilitydecaytrace.
regularlytoward0,orareverysmallconstants.Thedynamicsoftheeligibilitytraceis
morecomplex.LikeinTD(),theisin-cording,anddecaysvaluecreasedforthepairtracedynamics.toFollowingtheparameter
forQ(this)havegeneralbeenprinciple,forallexponentiallyac-proposedseveralpairs[Watkins,eligibility
19;PengandWilliams,1994;SinghandSutton,1996].Thesim-plestone,duetoWatkins[Watkins,19],isdirectlyderivedfromtheaccumulativeeligibilitytraceofTD(),andhasthesupplementaryeffectofresettingto0thewholetracevec-torwhenthecurrentactionisnotagreedyaction,thatis
thistraceisgivenby:
.Thecompletedefinitionof
ifif
ifif
(2)
,for,with
,
,and
.
Q()withaccumulativeeligibilitytraceisadirectgen-eralizationofQ-learning;Q(0)exactlycorrespondstoQ-learning,whereaftereachtransition.UnlikeisthetheconvergenceonlycomponentofQ-learningupdatedthathasbeenproved[BertsekasandTsitsiklis,1996;BorkarandMeyn,2000],theconvergenceofQ()anditsovercom-ingofQ-learninghaveonlybeenshownexperimentally[PengandWilliams,1994;WieringandSchmidhuber,1997].Inpractice,themaindrawbackofQ()isrelativetotheupdatecostsofandthatcomputationfortime.ofForQ(largeareproportional),
state-actionto,canbespacetheveryhigh.problems,sizethe
of
3AsymptoticapproximationofQ()
Inthissectionweintendtodefineanasymptoticapproxima-tionofWatkins’Q()when.Thefollowingassump-tionsaremade.Ateachiterationtheactionischosenwith
asemi-uniformexplorationpolicy:agreedyactionisse-lectedwithaprobability,oranyotheractioninischosenrandomly.Asymptotically,whenhasconvergedtoward,thisprocessdefinesaMarkovchainrent,aperiodic,.Weassumeirreducible)thatthisandMarkovthereforechainhasisaregularunique(recur-on
invari-antdistribution.
3.1
Asymptoticaccumulatingtrace
LetusconsiderGeneralizingobtainedatrajectory
ourbyQ(work),developedsuchthatconvergesto-ward.forTD()[Gar-ciaandSerre,2000],theaverageasymptoticbehaviourofthe
accumulativetracelated:
onthistrajectorycanbecalcu-Proposition1
.
3.2AsymptoticQ()
WeconsidertheasymptoticapproximationAQ()ofQ()obtainedbysubstitutingintheQ()updatefactor
Letusdenotebybyitsapproximationrulethe.(1)thetrace
withdiagonalentries
,andbydiagonalmatrix
theeligibilitymatrix
wherethestate-actionpairsofareorderedas
follows:sions).(thecolumnheadingsindicate,
theblockdimen-Letbethetemporaldifferencevector:
Invectornotation,theasymptoticapproximationAQ()ofQ()isdefinedby
(3)
4OptimalgainmatrixforQ-learning
Onecanseethat(3)isagainmatrixvariantofthebasicQ-learningalgorithm:
whereisthecurrentestimationofthetarget,and
istheobservedinputrandomvectorattime.Classicrein-forcementlearningalgorithmslikeQ-learningorTD()havealreadybeenanalysedwiththeODEmethod[BertsekasandTsitsiklis,1996;BorkarandMeyn,2000],inordertoprovetheirconvergence.
AnotherpotentialapplicationoftheODEmethodconcernstheoptimizationoftherateofconvergenceofstochastical-gorithms.Onecanshowthatundersomegeneralstabilityconditions[Benvenisteetal.,1990]thenewalgorithm
whereisthe
coef-
ficients
,
diagonalmatrixwithdiagonalthe
rectangularmatrix
ofcoefficients
isthediagonalmatrixwithdiagonalcoefficients,andistherectangularmatrixofcoefficients
whereistheprobabilityofmovingfromtobyapplyingthefirstactionandthenbyfollowingtheoptimalpolicyonthenexttransitions.
Thisoptimalgain
leadstoan“ideal”algorithm,calledOptimal-Q-learning:
5AverageQ-learning
Optimal-Q-learningcannotbedirectlyimplementedsinceitwouldrequirethecostlyonlineestimationofthecoeffi-cients.However,asshownnext,itispossibletodetermine
someand
parametersthatminimizethedistancebe-tweenthegainmatrixofAQ()and,andthatallowanefficientimplementationofthecorrespondingAQ()algo-rithm.
5.1OptimizationofAQ()
Tobeconsistentwiththedefinitionofthatassumesasizeequalto
,where
step-haveMakingcoefficientsequaltheequallowertoright1.
blocksleadstothechoice
.Anumericalanalysisoftheupperleftblocks,
similartotheonedevelopedin[GarciaandSerre,2000]forTD(),resultsinthechoice,andthus
.Then
Thisequalityexhibitssomesimilaritywithequation(4).Aswecannote,thestructuresofthetwogainsareidentical,whichmightexplaintheefficiencyofQ()for.Minimizing
(then
,
.
Thisvaluegreaterthan1couldbesurprising.Infactconvergence[BertsekasofeligibilityandTsitsiklis,trace1996methods],andso
requiresthat
isnomoreanadmissiblevaluefor,andthedistanceisminimizedfor
(6)
where
5.2EfficientimplementationofAQ-learning
Adirectimplementationoftheupdaterule(6)wouldhaveacomplexitysimilartoQ(),proportionalto.For-tunately,anequivalentexpressioncanbederived.Itonlyre-quiresasmallnumberofupdatesperiteration,independentofthenumberofstatesandactions.Proposition3
Theupdaterule(6)isequivalentto
if
and
The
errorcanbeexpressedasafunctionof
and
:
Notethatthegreedyaction
isdefined.Theby
update
ruleof
canbeimplementedbyreplacingthestepsize
,where
isthenumberequalsoftimesor
thestatehasbeenvisitedattime,andAQ-learningAlgorithm
Choose
in
(-explorationpolicy);
Simulate;
(,);
;
;
for
do
;
Inthatalgorithm,isthelengthofthelearningtrajectory.AQ-learningisparticularlyinteresting,sinceitonlyrequires3updatesperiteration,namelyandtheperformanceof.,Watkins’AQ-learningQ()shouldfortwothereforedistinctimprovereasons.First,AQ-learningisanoptimizedapproximationofQ()thatminimizestheasymptoticvariancearound.Second,likeQ-learning,AQ-learningisanasynchronousalgorithmthatonlyupdatesonecomponentofthevaluefunction(here)periteration.Itisclearthatforlargestate-actionspaceproblems,thisfeatureisofcrucialimportanceregardingthetimecomplexityofthealgorithm,asitisshowninsimulationsinthenextsection.
AmazinglyitappearsthatAQ-learningisveryclosetotheR-learningalgorithmproposedbySchwartz[Schwartz,1993].R-learningissupposedtoconvergetogainoptimalpolicies,thatmaximizetheaverage-cost
180160time (s)AQlearningQ(lambda)Optimal QlearningQlearning140120100806040200020406080100120140160180200number of statesFigure2:Computationtimefor
.Wechoseasemi-uniformexplorationstrategywith
.Foreachproblem,,andwereexactlycalculatedfromthedataand.
Figure1showstherelativelearningerror
randomtrajectories).thedifferent,withTheratesofconvergenceandsizeoftheofQ-learning,.MDPThiswasgraphAQ-learning,
illustrates
,
Q()for
,andOptimal-Q-learning.Theseexper-imentalresultsconfirmthatthefasterrateofconvergenceisobtainedwithOptimal-Q-learning,andthatAQ-learningconvergesnearlyasfastasthisidealalgorithm.AQ-learningevenperformsslightlybetterthanQ()withthesamelearn-ingratesandthevalue.ThesethreealgorithmsappeartobeconsiderablyfasterthanQ-learning.
Figure2illustratesforthesamealgorithmsthedifferentcomputationtimesrequiredtoachievearelativelearninger-rorequalto5%,asafunctionofthesize
ofthestatespace,foraconstant.Wenumberfirstobserveofactionson,withand
veryslow,despitethefactthatitthatonlygraphupdatesthatoneQ-learningcomponentis
periteration.ButthemainobservationisthatAQ-learningcanbereallymoreefficientthanQ(),especiallywhenthenumberofstatesisimportant;for200states,AQ-learningis10timesfasterthanQ().
7Conclusions
WehaveshownthatanasymptoticapproximationofQ()couldbeefficientlyimplementedthroughtwosimpleupdaterulesonarelativevaluefunctionfactor.ThisnewalgorithmAQ-learningcanandbeondescribedascalingasanaveragevariantofQ-learningforthediscountedcriterion.Itsupdatecomplexityisindependentofthenumberofstatesandactions,andexperimentalresultsshowthatitsconvergencetowardtheoptimalvaluefunctionisalwaysfasterthanforQ-learningorQ().
OurresultshavebeenobtainedfortheWatkins’Q()withaccumulatingeligibilitytrace.Itwouldbeinterestingtoap-plythesameapproachtothecaseofreplacingtraces[SinghandSutton,1996],]ortoPengandWilliams’Q()[PengandWilliams,1994wherethetraceisneverresetto0,andforwhichanefficientimplementationhasbeenproposedbyWieringandSchmidhuber[WieringandSchmidhuber,1997].Anotherdirectionforfutureworkconcernsthedevelop-mentofnewreinforcementlearningalgorithmsthatapprox-imateOptimal-Q-learning.Furthermore,wehaveseenthattheoptimalgainforQ-learningdependsontheexplorationpolicyparameterizedby.Thequestionoffindingthebestvalueofremainsopen.
Finally,weneedabetterunderstandingoftherela-tionbetween[AQ-learningtheSchwartz’R-learningalgo-rithmSchwartz,1993].
References
[Benvenisteetal.,1990]A.Benveniste,M.Metivier,andP.Priouret.AdaptiveAlgorithmsandStochasticApprox-imation.Springer-Verlag,Berlin,NewYork,1990.[BertsekasandTsitsiklis,1996]D.P.BertsekasandJ.N.Tsitsiklis.Neuro-DynamicProgramming.AthenaScien-tific,Belmont(MA),1996.
[BorkarandMeyn,2000]V.S.BorkarandS.P.Meyn.TheO.D.E.methodforconvergenceofstochasticapproxima-tionandreinforcementlearning.SIAMJournalofControlandOptimization,38:447–469,2000.
[Borkar,1996]V.S.Borkar.Stochasticapproximationwithtwo-timescales.SystemandControlLetters,29:291–294,1996.
[Cichosz,1995]P.Cichosz.Truncatingtemporaldiffer-ences:OntheefficientimplementationofTD()forrein-forcementlearning.JournalofArtificialIntelligenceRe-search(JAIR),2:287–318,1995.
[GarciaandNdiaye,1998]F.GarciaandS.Ndiaye.ALearningRateAnalysisofReinforcement-LearningAlgo-rithmsinFinite-Horizon.InInternationalConferenceonMachineLearning,volume15,Madison,USA,1998.[GarciaandSerre,2000]F.GarciaandF.Serre.Efficientasymptoticapproximationintemporaldifferencelearning.InProceedingsofthe14thEuropeanConferenceonArti-ficialIntelligence(ECAI2000),Berlin,2000.
[KearnsandSingh,2000]M.KearnsandS.P.Singh.”bias-Variance”errorboundsfortemporaldifferenceupdates.InProceedingsoftheCOLT2000conference,2000.
[KushnerandYin,1997]H.J.KushnerandG.GYin.StochasticApproximationAlgorithmsandApplications.Springer,1997.
[PengandWilliams,1994]J.PengandR.J.Williams.In-crementalMulti-StepQ-Learning.InInternationalCon-ferenceonMachineLearning,volume11,pages226–232,1994.
[Puterman,1994]M.L.Puterman.Markovdecisionpro-cesses:discretestochasticdynamicprogramming.Wiley-Interscience,NewYork,1994.
[Schwartz,1993]A.Schwartz.AReinforcementLearningMethodforMaximizingUndiscountedRewards.InIn-ternationalConferenceonMachineLearning,volume10,1993.
[SinghandDayan,1998]S.P.SinghandP.Dayan.Analyt-icalMeanSquaredErrorCurvesforTemporalDifferencelearning.MachineLearning,1998.
[SinghandSutton,1996]S.P.SinghandR.S.Sutton.Re-inforcementlearningwithreplacingeligibilitytraces.Ma-chineLearning,1996.
[Sutton,1988]R.S.Sutton.Learningtopredictbythemeth-odsoftemporaldifferences.MachineLearning,3:9–44,1988.
[Watkins,19]C.J.Watkins.LearningfromDelayedRe-wards.PhDthesis,CambridgeUniversity,Cambridge,England,19.
[WieringandSchmidhuber,1997]M.A.WieringandJ.Schmidhuber.FastonlineQ().MachineLearning,1997.
因篇幅问题不能全部显示,请点此查看更多更全内容
Copyright © 2019- sarr.cn 版权所有 赣ICP备2024042794号-1
违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com
本站由北京市万商天勤律师事务所王兴未律师提供法律服务