您的当前位置：首页 INRA- Unite de Biometrie et Intelligence Artificielle.

INRA- Unite de Biometrie et Intelligence Artificielle.

来源：飒榕旅游知识分享网

FromQ()toAverageQ-learning:

EfﬁcientImplementationofanAsymptoticApproximation

Fr´ed´erickGarciaandFlorentSerre

INRA-Unit´edeBiom´etrieetIntelligenceArtiﬁcielle.BP27,Auzeville.31326CastanetTolosancedex,France.

fgarcia@toulouse.inra.fr

Abstract

Q()isareinforcementlearningalgorithmthatcombinesQ-learningandTD().Onlineimple-mentationsofQ()thatuseeligibilitytraceshavebeenshowntospeedbasicQ-learning.Inthispa-perwepresentanasymptoticanalysisofWatkins’Q()withaccumulativeeligibilitytraces.WeﬁrstintroduceanasymptoticapproximationofQ()thatappearstobeagainmatrixvariantofbasicQ-learning.UsingtheODEmethod,wethendeter-mineanoptimalgainmatrixforQ-learningthatmaximizesitsrateofconvergencetowardtheop-timalvaluefunction.ThesimilaritybetweenthisoptimalgainandtheasymptoticgainofQ()explainstherelativeefﬁciencyofthetween.Furthermore,thesetwogains,byminimizingoptimalvaluesthedifferencelatterforforthebe-pa-rameterandthedecreasinglearningratescanbede-termined.Thisoptimalstronglydependsontheexplorationpolicyduringlearning.Arobustap-proximationoftheselearningparametersleadstothedeﬁnitionofanewefﬁcientalgorithmcalledAQ-learning(AverageQ-learning),thatshowsacloseresemblancetoSchwartz’R-learning.Ourresultshavebeendemonstratedthroughnumericalsimulations.

1Introduction

TD()andQ-learningareundoubtedlytwoofthemostim-portantmethodsthathavebeenproposedinReinforcementLearning.Q-learningisasimpledirectalgorithmforlearningoptimalpolicies,andtheTD()ruleisusedinmanyexistingreinforcementlearningandstochasticdynamicprogrammingalgorithmsforefﬁcientlyevaluatingagivenpolicy.

TheQ-learningalgorithmbasicallyimplementstheTD(0)updaterule.InordertoimprovetheconvergencerateofQ-learning,ithasbeenproposedtointegratetheTD(),

,withtheQ-learningupdaterule.Thatresultedinthefam-ilyofQ()algorithms[Watkins,19;PengandWilliams,

1994].ItiscommonlybelievedthatQ(),

,improvestheQ-learningbehaviour,butnodeﬁnitetheoreticalanalysisofQ()hascomeinsupportofthisconjecture:thereisnoproofofconvergenceforQ(),andasforTD(),theissue

oftherationalchoiceof]remains[SinghandDayan,1998;KearnsandSingh,2000.

TheuseofQ()ishamperedbyanotherimportantdif-ﬁculty.EligibilitytraceimplementationsofQ()basedonlook-uptablesrequiremanyupdatesofQ-valueandtracecomponentsaftereachtransition.Thetimecomplexityofthisapproachmaybeveryhighforlargestateandactionspaces.Inspiteofseveralrecentattempts[Cichosz,1995;WieringandSchmidhuber,1997],themulti-componentup-dateruleofQ()remainsproblematic.

Inthispaperweproposeanewalgorithmthateliminatetheabove-mentionedshortcomingofQ().Ourapproachre-liesonanasymptoticapproximationofWatkins’Q()(sec-tion2and3),thatappearstobeagainmatrixvariantofbasicQ-learning.Thenanoptimalconvergencerateanaly-sisofQ-learningbasedontheODEmethodleadsustode-termineanoptimalgainmatrixforQ-learning(section4).ThesimilarstructureofthesetwogainsexplainstherelativeefﬁciencyofQ().Furthermore,byminimizingthediffer-encebetweenthesetwogains,weareabletodeterminenear-optimalandlearningratesthatallowenefﬁcientimplemen-tationofanewreinforcementlearningalgorithmcalledAQ-learning(averageQ-learning),whichshowsacloseresem-blancetoSchwartz’R-learning[Schwartz,1993](section5).Wepresentinsection6someexperimentalresultsthatcon-ﬁrmthesoundnessoftheasymptoticanalysisandthenear-optimalbehaviourofAQ-learning.

2TheWatkins’Q()algorithm

LiketheTD()andQ-learningreinforcementlearningal-gorithms,Q()canbedescribedwithintheframeworkofMarkovDecisionProcesses(MDP)[Puterman,1994].As-sumeastationaryinﬁnite-horizonMDPmodelthatisdeﬁnedbyastatespaceofsizeandanactionspaceofsize,byaMarkoviandynamicsofonmovingcharacterizedfromtobythetran-sitionprobabilitiesafterap-plyingtheactionassociated,andbythetoeachinstantaneoustransitionrewardfunctions

Apolicytoisanyafunctionpossiblestate.Giventhataninitialassignsstateanaction

,fol-lowingapolicydeﬁnesaccordingtotheprobabilitieswitharandom,trajectory

TheoptimizationproblemassociatedtoadiscountedMarkovDecisionProblemistoﬁndapolicythatmaximizesforanyinitialstatetheexpectedsumofthediscountedrewardsalongisthediscountfactor.

,where

atrajectory

Thevaluefunction

ofapolicycanbedirectlylearntfromobservedtrajectoriesbyusingtheTD()algorithm,withoutmaintainingities

[directlyanestimationSutton,anestimationofthe1988]optimal.Itofisthevaluealsotransitionpossibleprobabil-functiontooflearntheproblemwiththeQ-learningalgorithm[Watkins,19].Q-learningregularlyupdatesanestimationoftheoptimalQ-valuefunctiondenotedby:

whichischaracterizedbytheoptimalityequations:

Anoptimalpolicycanthenbe,directlyderivedfrom

TheprincipleofQ()istocombineQ-learningandthetemporaldifferenceerrorusedintheTD()method,inor-dertoacceleratetheconvergenceofQ-learning.Forthediscountedvaluefunction,thepotentiallyinﬁnitelengthofthetrajectoriesleadsustoconsidertheQ()methodsbasedoneligibilitytraces.Aftereachobservedtransi-tion

valuefunctionby

,thesealgorithmsupdatetheestimated(1)

wherethecurrenttemporaldifference,is

Usually,istheisthestepsizeand

stepsizeseligibilitydecaytrace.

regularlytoward0,orareverysmallconstants.Thedynamicsoftheeligibilitytraceis

morecomplex.LikeinTD(),theisin-cording,anddecaysvaluecreasedforthepairtracedynamics.toFollowingtheparameter

forQ(this)havegeneralbeenprinciple,forallexponentiallyac-proposedseveralpairs[Watkins,eligibility

19;PengandWilliams,1994;SinghandSutton,1996].Thesim-plestone,duetoWatkins[Watkins,19],isdirectlyderivedfromtheaccumulativeeligibilitytraceofTD(),andhasthesupplementaryeffectofresettingto0thewholetracevec-torwhenthecurrentactionisnotagreedyaction,thatis

thistraceisgivenby:

.Thecompletedeﬁnitionof

ifif

(2)

,for,with

,and

Q()withaccumulativeeligibilitytraceisadirectgen-eralizationofQ-learning;Q(0)exactlycorrespondstoQ-learning,whereaftereachtransition.UnlikeisthetheconvergenceonlycomponentofQ-learningupdatedthathasbeenproved[BertsekasandTsitsiklis,1996;BorkarandMeyn,2000],theconvergenceofQ()anditsovercom-ingofQ-learninghaveonlybeenshownexperimentally[PengandWilliams,1994;WieringandSchmidhuber,1997].Inpractice,themaindrawbackofQ()isrelativetotheupdatecostsofandthatcomputationfortime.ofForQ(largeareproportional),

state-actionto,canbespacetheveryhigh.problems,sizethe

3AsymptoticapproximationofQ()

Inthissectionweintendtodeﬁneanasymptoticapproxima-tionofWatkins’Q()when.Thefollowingassump-tionsaremade.Ateachiterationtheactionischosenwith

asemi-uniformexplorationpolicy:agreedyactionisse-lectedwithaprobability,oranyotheractioninischosenrandomly.Asymptotically,whenhasconvergedtoward,thisprocessdeﬁnesaMarkovchainrent,aperiodic,.Weassumeirreducible)thatthisandMarkovthereforechainhasisaregularunique(recur-on

invari-antdistribution.

3.1

Asymptoticaccumulatingtrace

LetusconsiderGeneralizingobtainedatrajectory

ourbyQ(work),developedsuchthatconvergesto-ward.forTD()[Gar-ciaandSerre,2000],theaverageasymptoticbehaviourofthe

accumulativetracelated:

onthistrajectorycanbecalcu-Proposition1

3.2AsymptoticQ()

WeconsidertheasymptoticapproximationAQ()ofQ()obtainedbysubstitutingintheQ()updatefactor

Letusdenotebybyitsapproximationrulethe.(1)thetrace

withdiagonalentries

,andbydiagonalmatrix

theeligibilitymatrix

wherethestate-actionpairsofareorderedas

follows:sions).(thecolumnheadingsindicate,

theblockdimen-Letbethetemporaldifferencevector:

Invectornotation,theasymptoticapproximationAQ()ofQ()isdeﬁnedby

(3)

4OptimalgainmatrixforQ-learning

Onecanseethat(3)isagainmatrixvariantofthebasicQ-learningalgorithm:

whereisthecurrentestimationofthetarget,and

istheobservedinputrandomvectorattime.Classicrein-forcementlearningalgorithmslikeQ-learningorTD()havealreadybeenanalysedwiththeODEmethod[BertsekasandTsitsiklis,1996;BorkarandMeyn,2000],inordertoprovetheirconvergence.

AnotherpotentialapplicationoftheODEmethodconcernstheoptimizationoftherateofconvergenceofstochastical-gorithms.Onecanshowthatundersomegeneralstabilityconditions[Benvenisteetal.,1990]thenewalgorithm

whereisthe

coef-

ﬁcients

diagonalmatrixwithdiagonalthe

rectangularmatrix

ofcoefﬁcients

isthediagonalmatrixwithdiagonalcoefﬁcients,andistherectangularmatrixofcoefﬁcients

whereistheprobabilityofmovingfromtobyapplyingtheﬁrstactionandthenbyfollowingtheoptimalpolicyonthenexttransitions.

Thisoptimalgain

leadstoan“ideal”algorithm,calledOptimal-Q-learning:

5AverageQ-learning

Optimal-Q-learningcannotbedirectlyimplementedsinceitwouldrequirethecostlyonlineestimationofthecoefﬁ-cients.However,asshownnext,itispossibletodetermine

someand

parametersthatminimizethedistancebe-tweenthegainmatrixofAQ()and,andthatallowanefﬁcientimplementationofthecorrespondingAQ()algo-rithm.

5.1OptimizationofAQ()

Tobeconsistentwiththedeﬁnitionofthatassumesasizeequalto

,where

step-haveMakingcoefﬁcientsequaltheequallowertoright1.

blocksleadstothechoice

.Anumericalanalysisoftheupperleftblocks,

similartotheonedevelopedin[GarciaandSerre,2000]forTD(),resultsinthechoice,andthus

.Then

Thisequalityexhibitssomesimilaritywithequation(4).Aswecannote,thestructuresofthetwogainsareidentical,whichmightexplaintheefﬁciencyofQ()for.Minimizing

(then

Thisvaluegreaterthan1couldbesurprising.Infactconvergence[BertsekasofeligibilityandTsitsiklis,trace1996methods],andso

requiresthat

isnomoreanadmissiblevaluefor,andthedistanceisminimizedfor

(6)

where

5.2EfﬁcientimplementationofAQ-learning

Adirectimplementationoftheupdaterule(6)wouldhaveacomplexitysimilartoQ(),proportionalto.For-tunately,anequivalentexpressioncanbederived.Itonlyre-quiresasmallnumberofupdatesperiteration,independentofthenumberofstatesandactions.Proposition3

Theupdaterule(6)isequivalentto

and

The

errorcanbeexpressedasafunctionof

and

Notethatthegreedyaction

isdeﬁned.Theby

update

ruleof

canbeimplementedbyreplacingthestepsize

,where

isthenumberequalsoftimesor

thestatehasbeenvisitedattime,andAQ-learningAlgorithm

Choose

(-explorationpolicy);

Simulate;

(,);

;

for

;

Inthatalgorithm,isthelengthofthelearningtrajectory.AQ-learningisparticularlyinteresting,sinceitonlyrequires3updatesperiteration,namelyandtheperformanceof.,Watkins’AQ-learningQ()shouldfortwothereforedistinctimprovereasons.First,AQ-learningisanoptimizedapproximationofQ()thatminimizestheasymptoticvariancearound.Second,likeQ-learning,AQ-learningisanasynchronousalgorithmthatonlyupdatesonecomponentofthevaluefunction(here)periteration.Itisclearthatforlargestate-actionspaceproblems,thisfeatureisofcrucialimportanceregardingthetimecomplexityofthealgorithm,asitisshowninsimulationsinthenextsection.

AmazinglyitappearsthatAQ-learningisveryclosetotheR-learningalgorithmproposedbySchwartz[Schwartz,1993].R-learningissupposedtoconvergetogainoptimalpolicies,thatmaximizetheaverage-cost

180160time (s)AQlearningQ(lambda)Optimal QlearningQlearning140120100806040200020406080100120140160180200number of statesFigure2:Computationtimefor

.Wechoseasemi-uniformexplorationstrategywith

.Foreachproblem,,andwereexactlycalculatedfromthedataand.

Figure1showstherelativelearningerror

randomtrajectories).thedifferent,withTheratesofconvergenceandsizeoftheofQ-learning,.MDPThiswasgraphAQ-learning,

illustrates

Q()for

,andOptimal-Q-learning.Theseexper-imentalresultsconﬁrmthatthefasterrateofconvergenceisobtainedwithOptimal-Q-learning,andthatAQ-learningconvergesnearlyasfastasthisidealalgorithm.AQ-learningevenperformsslightlybetterthanQ()withthesamelearn-ingratesandthevalue.ThesethreealgorithmsappeartobeconsiderablyfasterthanQ-learning.

Figure2illustratesforthesamealgorithmsthedifferentcomputationtimesrequiredtoachievearelativelearninger-rorequalto5%,asafunctionofthesize

ofthestatespace,foraconstant.Wenumberﬁrstobserveofactionson,withand

veryslow,despitethefactthatitthatonlygraphupdatesthatoneQ-learningcomponentis

periteration.ButthemainobservationisthatAQ-learningcanbereallymoreefﬁcientthanQ(),especiallywhenthenumberofstatesisimportant;for200states,AQ-learningis10timesfasterthanQ().

7Conclusions

WehaveshownthatanasymptoticapproximationofQ()couldbeefﬁcientlyimplementedthroughtwosimpleupdaterulesonarelativevaluefunctionfactor.ThisnewalgorithmAQ-learningcanandbeondescribedascalingasanaveragevariantofQ-learningforthediscountedcriterion.Itsupdatecomplexityisindependentofthenumberofstatesandactions,andexperimentalresultsshowthatitsconvergencetowardtheoptimalvaluefunctionisalwaysfasterthanforQ-learningorQ().

OurresultshavebeenobtainedfortheWatkins’Q()withaccumulatingeligibilitytrace.Itwouldbeinterestingtoap-plythesameapproachtothecaseofreplacingtraces[SinghandSutton,1996],]ortoPengandWilliams’Q()[PengandWilliams,1994wherethetraceisneverresetto0,andforwhichanefﬁcientimplementationhasbeenproposedbyWieringandSchmidhuber[WieringandSchmidhuber,1997].Anotherdirectionforfutureworkconcernsthedevelop-mentofnewreinforcementlearningalgorithmsthatapprox-imateOptimal-Q-learning.Furthermore,wehaveseenthattheoptimalgainforQ-learningdependsontheexplorationpolicyparameterizedby.Thequestionofﬁndingthebestvalueofremainsopen.

Finally,weneedabetterunderstandingoftherela-tionbetween[AQ-learningtheSchwartz’R-learningalgo-rithmSchwartz,1993].

References

[Benvenisteetal.,1990]A.Benveniste,M.Metivier,andP.Priouret.AdaptiveAlgorithmsandStochasticApprox-imation.Springer-Verlag,Berlin,NewYork,1990.[BertsekasandTsitsiklis,1996]D.P.BertsekasandJ.N.Tsitsiklis.Neuro-DynamicProgramming.AthenaScien-tiﬁc,Belmont(MA),1996.

[BorkarandMeyn,2000]V.S.BorkarandS.P.Meyn.TheO.D.E.methodforconvergenceofstochasticapproxima-tionandreinforcementlearning.SIAMJournalofControlandOptimization,38:447–469,2000.

[Borkar,1996]V.S.Borkar.Stochasticapproximationwithtwo-timescales.SystemandControlLetters,29:291–294,1996.

[Cichosz,1995]P.Cichosz.Truncatingtemporaldiffer-ences:OntheefﬁcientimplementationofTD()forrein-forcementlearning.JournalofArtiﬁcialIntelligenceRe-search(JAIR),2:287–318,1995.

[GarciaandNdiaye,1998]F.GarciaandS.Ndiaye.ALearningRateAnalysisofReinforcement-LearningAlgo-rithmsinFinite-Horizon.InInternationalConferenceonMachineLearning,volume15,Madison,USA,1998.[GarciaandSerre,2000]F.GarciaandF.Serre.Efﬁcientasymptoticapproximationintemporaldifferencelearning.InProceedingsofthe14thEuropeanConferenceonArti-ﬁcialIntelligence(ECAI2000),Berlin,2000.

[KearnsandSingh,2000]M.KearnsandS.P.Singh.”bias-Variance”errorboundsfortemporaldifferenceupdates.InProceedingsoftheCOLT2000conference,2000.

[KushnerandYin,1997]H.J.KushnerandG.GYin.StochasticApproximationAlgorithmsandApplications.Springer,1997.

[PengandWilliams,1994]J.PengandR.J.Williams.In-crementalMulti-StepQ-Learning.InInternationalCon-ferenceonMachineLearning,volume11,pages226–232,1994.

[Puterman,1994]M.L.Puterman.Markovdecisionpro-cesses:discretestochasticdynamicprogramming.Wiley-Interscience,NewYork,1994.

[Schwartz,1993]A.Schwartz.AReinforcementLearningMethodforMaximizingUndiscountedRewards.InIn-ternationalConferenceonMachineLearning,volume10,1993.

[SinghandDayan,1998]S.P.SinghandP.Dayan.Analyt-icalMeanSquaredErrorCurvesforTemporalDifferencelearning.MachineLearning,1998.

[SinghandSutton,1996]S.P.SinghandR.S.Sutton.Re-inforcementlearningwithreplacingeligibilitytraces.Ma-chineLearning,1996.

[Sutton,1988]R.S.Sutton.Learningtopredictbythemeth-odsoftemporaldifferences.MachineLearning,3:9–44,1988.

[Watkins,19]C.J.Watkins.LearningfromDelayedRe-wards.PhDthesis,CambridgeUniversity,Cambridge,England,19.

[WieringandSchmidhuber,1997]M.A.WieringandJ.Schmidhuber.FastonlineQ().MachineLearning,1997.

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文