您好,欢迎来到飒榕旅游知识分享网。
搜索
您的当前位置:首页INRA- Unite de Biometrie et Intelligence Artificielle.

INRA- Unite de Biometrie et Intelligence Artificielle.

来源:飒榕旅游知识分享网
FromQ()toAverageQ-learning:

EfficientImplementationofanAsymptoticApproximation

Fr´ed´erickGarciaandFlorentSerre

INRA-Unit´edeBiom´etrieetIntelligenceArtificielle.BP27,Auzeville.31326CastanetTolosancedex,France.

fgarcia@toulouse.inra.fr

Abstract

Q()isareinforcementlearningalgorithmthatcombinesQ-learningandTD().Onlineimple-mentationsofQ()thatuseeligibilitytraceshavebeenshowntospeedbasicQ-learning.Inthispa-perwepresentanasymptoticanalysisofWatkins’Q()withaccumulativeeligibilitytraces.WefirstintroduceanasymptoticapproximationofQ()thatappearstobeagainmatrixvariantofbasicQ-learning.UsingtheODEmethod,wethendeter-mineanoptimalgainmatrixforQ-learningthatmaximizesitsrateofconvergencetowardtheop-timalvaluefunction.ThesimilaritybetweenthisoptimalgainandtheasymptoticgainofQ()explainstherelativeefficiencyofthetween.Furthermore,thesetwogains,byminimizingoptimalvaluesthedifferencelatterforforthebe-pa-rameterandthedecreasinglearningratescanbede-termined.Thisoptimalstronglydependsontheexplorationpolicyduringlearning.Arobustap-proximationoftheselearningparametersleadstothedefinitionofanewefficientalgorithmcalledAQ-learning(AverageQ-learning),thatshowsacloseresemblancetoSchwartz’R-learning.Ourresultshavebeendemonstratedthroughnumericalsimulations.

1Introduction

TD()andQ-learningareundoubtedlytwoofthemostim-portantmethodsthathavebeenproposedinReinforcementLearning.Q-learningisasimpledirectalgorithmforlearningoptimalpolicies,andtheTD()ruleisusedinmanyexistingreinforcementlearningandstochasticdynamicprogrammingalgorithmsforefficientlyevaluatingagivenpolicy.

TheQ-learningalgorithmbasicallyimplementstheTD(0)updaterule.InordertoimprovetheconvergencerateofQ-learning,ithasbeenproposedtointegratetheTD(),

,withtheQ-learningupdaterule.Thatresultedinthefam-ilyofQ()algorithms[Watkins,19;PengandWilliams,

1994].ItiscommonlybelievedthatQ(),

,improvestheQ-learningbehaviour,butnodefinitetheoreticalanalysisofQ()hascomeinsupportofthisconjecture:thereisnoproofofconvergenceforQ(),andasforTD(),theissue

oftherationalchoiceof]remains[SinghandDayan,1998;KearnsandSingh,2000.

TheuseofQ()ishamperedbyanotherimportantdif-ficulty.EligibilitytraceimplementationsofQ()basedonlook-uptablesrequiremanyupdatesofQ-valueandtracecomponentsaftereachtransition.Thetimecomplexityofthisapproachmaybeveryhighforlargestateandactionspaces.Inspiteofseveralrecentattempts[Cichosz,1995;WieringandSchmidhuber,1997],themulti-componentup-dateruleofQ()remainsproblematic.

Inthispaperweproposeanewalgorithmthateliminatetheabove-mentionedshortcomingofQ().Ourapproachre-liesonanasymptoticapproximationofWatkins’Q()(sec-tion2and3),thatappearstobeagainmatrixvariantofbasicQ-learning.Thenanoptimalconvergencerateanaly-sisofQ-learningbasedontheODEmethodleadsustode-termineanoptimalgainmatrixforQ-learning(section4).ThesimilarstructureofthesetwogainsexplainstherelativeefficiencyofQ().Furthermore,byminimizingthediffer-encebetweenthesetwogains,weareabletodeterminenear-optimalandlearningratesthatallowenefficientimplemen-tationofanewreinforcementlearningalgorithmcalledAQ-learning(averageQ-learning),whichshowsacloseresem-blancetoSchwartz’R-learning[Schwartz,1993](section5).Wepresentinsection6someexperimentalresultsthatcon-firmthesoundnessoftheasymptoticanalysisandthenear-optimalbehaviourofAQ-learning.

2TheWatkins’Q()algorithm

LiketheTD()andQ-learningreinforcementlearningal-gorithms,Q()canbedescribedwithintheframeworkofMarkovDecisionProcesses(MDP)[Puterman,1994].As-sumeastationaryinfinite-horizonMDPmodelthatisdefinedbyastatespaceofsizeandanactionspaceofsize,byaMarkoviandynamicsofonmovingcharacterizedfromtobythetran-sitionprobabilitiesafterap-plyingtheactionassociated,andbythetoeachinstantaneoustransitionrewardfunctions

.

Apolicytoisanyafunctionpossiblestate.Giventhataninitialassignsstateanaction

,fol-lowingapolicydefinesaccordingtotheprobabilitieswitharandom,trajectory

.

,

TheoptimizationproblemassociatedtoadiscountedMarkovDecisionProblemistofindapolicythatmaximizesforanyinitialstatetheexpectedsumofthediscountedrewardsalongisthediscountfactor.

,where

atrajectory

Thevaluefunction

ofapolicycanbedirectlylearntfromobservedtrajectoriesbyusingtheTD()algorithm,withoutmaintainingities

[directlyanestimationSutton,anestimationofthe1988]optimal.Itofisthevaluealsotransitionpossibleprobabil-functiontooflearntheproblemwiththeQ-learningalgorithm[Watkins,19].Q-learningregularlyupdatesanestimationoftheoptimalQ-valuefunctiondenotedby:

whichischaracterizedbytheoptimalityequations:

Anoptimalpolicycanthenbe,directlyderivedfrom

by

.

TheprincipleofQ()istocombineQ-learningandthetemporaldifferenceerrorusedintheTD()method,inor-dertoacceleratetheconvergenceofQ-learning.Forthediscountedvaluefunction,thepotentiallyinfinitelengthofthetrajectoriesleadsustoconsidertheQ()methodsbasedoneligibilitytraces.Aftereachobservedtransi-tion

valuefunctionby

,thesealgorithmsupdatetheestimated(1)

wherethecurrenttemporaldifference,is

Usually,istheisthestepsizeand

stepsizeseligibilitydecaytrace.

regularlytoward0,orareverysmallconstants.Thedynamicsoftheeligibilitytraceis

morecomplex.LikeinTD(),theisin-cording,anddecaysvaluecreasedforthepairtracedynamics.toFollowingtheparameter

forQ(this)havegeneralbeenprinciple,forallexponentiallyac-proposedseveralpairs[Watkins,eligibility

19;PengandWilliams,1994;SinghandSutton,1996].Thesim-plestone,duetoWatkins[Watkins,19],isdirectlyderivedfromtheaccumulativeeligibilitytraceofTD(),andhasthesupplementaryeffectofresettingto0thewholetracevec-torwhenthecurrentactionisnotagreedyaction,thatis

thistraceisgivenby:

.Thecompletedefinitionof

ifif

ifif

(2)

,for,with

,

,and

.

Q()withaccumulativeeligibilitytraceisadirectgen-eralizationofQ-learning;Q(0)exactlycorrespondstoQ-learning,whereaftereachtransition.UnlikeisthetheconvergenceonlycomponentofQ-learningupdatedthathasbeenproved[BertsekasandTsitsiklis,1996;BorkarandMeyn,2000],theconvergenceofQ()anditsovercom-ingofQ-learninghaveonlybeenshownexperimentally[PengandWilliams,1994;WieringandSchmidhuber,1997].Inpractice,themaindrawbackofQ()isrelativetotheupdatecostsofandthatcomputationfortime.ofForQ(largeareproportional),

state-actionto,canbespacetheveryhigh.problems,sizethe

of

3AsymptoticapproximationofQ()

Inthissectionweintendtodefineanasymptoticapproxima-tionofWatkins’Q()when.Thefollowingassump-tionsaremade.Ateachiterationtheactionischosenwith

asemi-uniformexplorationpolicy:agreedyactionisse-lectedwithaprobability,oranyotheractioninischosenrandomly.Asymptotically,whenhasconvergedtoward,thisprocessdefinesaMarkovchainrent,aperiodic,.Weassumeirreducible)thatthisandMarkovthereforechainhasisaregularunique(recur-on

invari-antdistribution.

3.1

Asymptoticaccumulatingtrace

LetusconsiderGeneralizingobtainedatrajectory

ourbyQ(work),developedsuchthatconvergesto-ward.forTD()[Gar-ciaandSerre,2000],theaverageasymptoticbehaviourofthe

accumulativetracelated:

onthistrajectorycanbecalcu-Proposition1

.

3.2AsymptoticQ()

WeconsidertheasymptoticapproximationAQ()ofQ()obtainedbysubstitutingintheQ()updatefactor

Letusdenotebybyitsapproximationrulethe.(1)thetrace

withdiagonalentries

,andbydiagonalmatrix

theeligibilitymatrix

wherethestate-actionpairsofareorderedas

follows:sions).(thecolumnheadingsindicate,

theblockdimen-Letbethetemporaldifferencevector:

Invectornotation,theasymptoticapproximationAQ()ofQ()isdefinedby

(3)

4OptimalgainmatrixforQ-learning

Onecanseethat(3)isagainmatrixvariantofthebasicQ-learningalgorithm:

whereisthecurrentestimationofthetarget,and

istheobservedinputrandomvectorattime.Classicrein-forcementlearningalgorithmslikeQ-learningorTD()havealreadybeenanalysedwiththeODEmethod[BertsekasandTsitsiklis,1996;BorkarandMeyn,2000],inordertoprovetheirconvergence.

AnotherpotentialapplicationoftheODEmethodconcernstheoptimizationoftherateofconvergenceofstochastical-gorithms.Onecanshowthatundersomegeneralstabilityconditions[Benvenisteetal.,1990]thenewalgorithm

whereisthe

coef-

ficients

,

diagonalmatrixwithdiagonalthe

rectangularmatrix

ofcoefficients

isthediagonalmatrixwithdiagonalcoefficients,andistherectangularmatrixofcoefficients

whereistheprobabilityofmovingfromtobyapplyingthefirstactionandthenbyfollowingtheoptimalpolicyonthenexttransitions.

Thisoptimalgain

leadstoan“ideal”algorithm,calledOptimal-Q-learning:

5AverageQ-learning

Optimal-Q-learningcannotbedirectlyimplementedsinceitwouldrequirethecostlyonlineestimationofthecoeffi-cients.However,asshownnext,itispossibletodetermine

someand

parametersthatminimizethedistancebe-tweenthegainmatrixofAQ()and,andthatallowanefficientimplementationofthecorrespondingAQ()algo-rithm.

5.1OptimizationofAQ()

Tobeconsistentwiththedefinitionofthatassumesasizeequalto

,where

step-haveMakingcoefficientsequaltheequallowertoright1.

blocksleadstothechoice

.Anumericalanalysisoftheupperleftblocks,

similartotheonedevelopedin[GarciaandSerre,2000]forTD(),resultsinthechoice,andthus

.Then

Thisequalityexhibitssomesimilaritywithequation(4).Aswecannote,thestructuresofthetwogainsareidentical,whichmightexplaintheefficiencyofQ()for.Minimizing

(then

,

.

Thisvaluegreaterthan1couldbesurprising.Infactconvergence[BertsekasofeligibilityandTsitsiklis,trace1996methods],andso

requiresthat

isnomoreanadmissiblevaluefor,andthedistanceisminimizedfor

(6)

where

5.2EfficientimplementationofAQ-learning

Adirectimplementationoftheupdaterule(6)wouldhaveacomplexitysimilartoQ(),proportionalto.For-tunately,anequivalentexpressioncanbederived.Itonlyre-quiresasmallnumberofupdatesperiteration,independentofthenumberofstatesandactions.Proposition3

Theupdaterule(6)isequivalentto

if

and

The

errorcanbeexpressedasafunctionof

and

:

Notethatthegreedyaction

isdefined.Theby

update

ruleof

canbeimplementedbyreplacingthestepsize

,where

isthenumberequalsoftimesor

thestatehasbeenvisitedattime,andAQ-learningAlgorithm

Choose

in

(-explorationpolicy);

Simulate;

(,);

;

;

for

do

;

Inthatalgorithm,isthelengthofthelearningtrajectory.AQ-learningisparticularlyinteresting,sinceitonlyrequires3updatesperiteration,namelyandtheperformanceof.,Watkins’AQ-learningQ()shouldfortwothereforedistinctimprovereasons.First,AQ-learningisanoptimizedapproximationofQ()thatminimizestheasymptoticvariancearound.Second,likeQ-learning,AQ-learningisanasynchronousalgorithmthatonlyupdatesonecomponentofthevaluefunction(here)periteration.Itisclearthatforlargestate-actionspaceproblems,thisfeatureisofcrucialimportanceregardingthetimecomplexityofthealgorithm,asitisshowninsimulationsinthenextsection.

AmazinglyitappearsthatAQ-learningisveryclosetotheR-learningalgorithmproposedbySchwartz[Schwartz,1993].R-learningissupposedtoconvergetogainoptimalpolicies,thatmaximizetheaverage-cost

180160time (s)AQlearningQ(lambda)Optimal QlearningQlearning140120100806040200020406080100120140160180200number of statesFigure2:Computationtimefor

.Wechoseasemi-uniformexplorationstrategywith

.Foreachproblem,,andwereexactlycalculatedfromthedataand.

Figure1showstherelativelearningerror

randomtrajectories).thedifferent,withTheratesofconvergenceandsizeoftheofQ-learning,.MDPThiswasgraphAQ-learning,

illustrates

,

Q()for

,andOptimal-Q-learning.Theseexper-imentalresultsconfirmthatthefasterrateofconvergenceisobtainedwithOptimal-Q-learning,andthatAQ-learningconvergesnearlyasfastasthisidealalgorithm.AQ-learningevenperformsslightlybetterthanQ()withthesamelearn-ingratesandthevalue.ThesethreealgorithmsappeartobeconsiderablyfasterthanQ-learning.

Figure2illustratesforthesamealgorithmsthedifferentcomputationtimesrequiredtoachievearelativelearninger-rorequalto5%,asafunctionofthesize

ofthestatespace,foraconstant.Wenumberfirstobserveofactionson,withand

veryslow,despitethefactthatitthatonlygraphupdatesthatoneQ-learningcomponentis

periteration.ButthemainobservationisthatAQ-learningcanbereallymoreefficientthanQ(),especiallywhenthenumberofstatesisimportant;for200states,AQ-learningis10timesfasterthanQ().

7Conclusions

WehaveshownthatanasymptoticapproximationofQ()couldbeefficientlyimplementedthroughtwosimpleupdaterulesonarelativevaluefunctionfactor.ThisnewalgorithmAQ-learningcanandbeondescribedascalingasanaveragevariantofQ-learningforthediscountedcriterion.Itsupdatecomplexityisindependentofthenumberofstatesandactions,andexperimentalresultsshowthatitsconvergencetowardtheoptimalvaluefunctionisalwaysfasterthanforQ-learningorQ().

OurresultshavebeenobtainedfortheWatkins’Q()withaccumulatingeligibilitytrace.Itwouldbeinterestingtoap-plythesameapproachtothecaseofreplacingtraces[SinghandSutton,1996],]ortoPengandWilliams’Q()[PengandWilliams,1994wherethetraceisneverresetto0,andforwhichanefficientimplementationhasbeenproposedbyWieringandSchmidhuber[WieringandSchmidhuber,1997].Anotherdirectionforfutureworkconcernsthedevelop-mentofnewreinforcementlearningalgorithmsthatapprox-imateOptimal-Q-learning.Furthermore,wehaveseenthattheoptimalgainforQ-learningdependsontheexplorationpolicyparameterizedby.Thequestionoffindingthebestvalueofremainsopen.

Finally,weneedabetterunderstandingoftherela-tionbetween[AQ-learningtheSchwartz’R-learningalgo-rithmSchwartz,1993].

References

[Benvenisteetal.,1990]A.Benveniste,M.Metivier,andP.Priouret.AdaptiveAlgorithmsandStochasticApprox-imation.Springer-Verlag,Berlin,NewYork,1990.[BertsekasandTsitsiklis,1996]D.P.BertsekasandJ.N.Tsitsiklis.Neuro-DynamicProgramming.AthenaScien-tific,Belmont(MA),1996.

[BorkarandMeyn,2000]V.S.BorkarandS.P.Meyn.TheO.D.E.methodforconvergenceofstochasticapproxima-tionandreinforcementlearning.SIAMJournalofControlandOptimization,38:447–469,2000.

[Borkar,1996]V.S.Borkar.Stochasticapproximationwithtwo-timescales.SystemandControlLetters,29:291–294,1996.

[Cichosz,1995]P.Cichosz.Truncatingtemporaldiffer-ences:OntheefficientimplementationofTD()forrein-forcementlearning.JournalofArtificialIntelligenceRe-search(JAIR),2:287–318,1995.

[GarciaandNdiaye,1998]F.GarciaandS.Ndiaye.ALearningRateAnalysisofReinforcement-LearningAlgo-rithmsinFinite-Horizon.InInternationalConferenceonMachineLearning,volume15,Madison,USA,1998.[GarciaandSerre,2000]F.GarciaandF.Serre.Efficientasymptoticapproximationintemporaldifferencelearning.InProceedingsofthe14thEuropeanConferenceonArti-ficialIntelligence(ECAI2000),Berlin,2000.

[KearnsandSingh,2000]M.KearnsandS.P.Singh.”bias-Variance”errorboundsfortemporaldifferenceupdates.InProceedingsoftheCOLT2000conference,2000.

[KushnerandYin,1997]H.J.KushnerandG.GYin.StochasticApproximationAlgorithmsandApplications.Springer,1997.

[PengandWilliams,1994]J.PengandR.J.Williams.In-crementalMulti-StepQ-Learning.InInternationalCon-ferenceonMachineLearning,volume11,pages226–232,1994.

[Puterman,1994]M.L.Puterman.Markovdecisionpro-cesses:discretestochasticdynamicprogramming.Wiley-Interscience,NewYork,1994.

[Schwartz,1993]A.Schwartz.AReinforcementLearningMethodforMaximizingUndiscountedRewards.InIn-ternationalConferenceonMachineLearning,volume10,1993.

[SinghandDayan,1998]S.P.SinghandP.Dayan.Analyt-icalMeanSquaredErrorCurvesforTemporalDifferencelearning.MachineLearning,1998.

[SinghandSutton,1996]S.P.SinghandR.S.Sutton.Re-inforcementlearningwithreplacingeligibilitytraces.Ma-chineLearning,1996.

[Sutton,1988]R.S.Sutton.Learningtopredictbythemeth-odsoftemporaldifferences.MachineLearning,3:9–44,1988.

[Watkins,19]C.J.Watkins.LearningfromDelayedRe-wards.PhDthesis,CambridgeUniversity,Cambridge,England,19.

[WieringandSchmidhuber,1997]M.A.WieringandJ.Schmidhuber.FastonlineQ().MachineLearning,1997.

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- sarr.cn 版权所有 赣ICP备2024042794号-1

违法及侵权请联系:TEL:199 1889 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务