搜索
您的当前位置:首页Pattern matching in text compressed by using antidictionaries

Pattern matching in text compressed by using antidictionaries

来源:飒榕旅游知识分享网
PatternMatchinginTextCompressed

byUsingAntidictionaries

YusukeShibata,MasayukiTakeda,AyumiShinohara,andSetsuoArikawa

DepartmentofInformatics,KyushuUniversity33

Fukuoka812-8581,Japan

{yusuke,takeda,ayumi,arikawa}@i.kyushu-u.ac.jp

http://www.i.kyushu-u.ac.jp

Abstract.Inthispaperwefocusontheproblemofcompressedpatternmatchingforthetextcompressionusingantidictionaries,whichisanewcompressionschemeproposedrecentlybyCrochemoreetal.(1998).WeshowanalgorithmwhichpreprocessesapatternoflengthmandanantidictionaryMinO(m2+󰀄M󰀄)time,andthenscansacompressedtextoflengthninO(n+r)timetofindallpatternoccurrences,where󰀄M󰀄isthetotallengthofstringsinMandristhenumberofthepatternoccurrences.

1Introduction

Compressedpatternmatchingisoneofthemostinterestingtopicsinthecom-binatorialpatternmatching,andmanystudieshavebeenundertakenonthisproblemforseveralcompressionmethodsfromboththeoreticalandpracticalviewpoints.SeeTable1.Oneimportantgoalofcompressedpatternmatchingistoachievealineartimecomplexitythatisproportionalnottotheoriginaltextlengthbuttothecompressedtextlength.

Recently,Crochemoreetal.proposedanewcompressionscheme:textcom-pressionusingantidictionary[8].Contrarytothecompressionmethodsthatmakeuseofdictionaries,whichareparticularsetsofstringsoccurringintexts,thenewschemeexploitsanantidictionarythatisafinitesetofstringsthatdonotoccurasfactorsintext,i.e.thatareforbidden.Leta1...an∈{0,1}+bethetexttobecompressed.Supposewehavereadaprefixa1...ajatacertainmoment.Ifthestringai...ajb(i≤j,b∈{0,1})isaforbiddenword,namely,isintheantidictionary,thenthenextsymbolaj+1cannotbeb.Inotherwords,thenextsymbolaj+1ispredictable.Basedonthisidea,thecompressionmethodremovessuchpredictablesymbolsfromthetext.Thecompressionandthede-compressionareperformedbyusingtheautomatonacceptingthesetofstringsinwhichnoforbiddenwordsoccurasfactors.

Inthispaperwefocusontheproblemofcompressedpatternmatchingforthetextcompressionusingantidictionaries.WepresentanalgorithmthatsolvestheprobleminO(m2+󰀆M󰀆+n+r)timeusingO(m2+󰀆M󰀆)space,wheremandnarethepatternlengthandthecompressedtextlength,respectively,󰀆M󰀆

M.Crochemore,M.Paterson(Eds.):CPM’99,LNCS1645,pp.37–49,1999.cSpringer-VerlagBerlinHeidelberg1999󰀄

38YusukeShibataetal.

Table1.Compressedpatternmatching.

compressionmethodcompressedpatternmatchingalgorithmsrun-lengthEilam-TzoreffandVishkin[11]

run-length(twodim.)Amir,Landau,andVishkin[6];AmirandBenson

[2,3];Amir,Benson,andFarach[5]

LZ77FarachandThorup[12];G¸asieniec,Karpinski,

Plandowski,andRytter[14]

LZWAmir,Benson,andFarach[4];Kida,Takeda,Shi-nohara,Miyazaki,andArikawa[17];Kida,Takeda,

Shinohara,andArikawa[16]

straight-lineprogramKarpinski,Rytter,andShinohara[15];Miyazaki,

Shinohara,andTakeda[20]

HuffmanFukamachi,Shinohara,andTakeda[13];Miyazaki,

Fukamachi,Takeda,andShinohara[19]

finitestateencodingTakeda[22]

wordbasedencodingMoura,Navarro,Ziviani,andBaeza-Yates[9,10]patternsubstitutionManber[18];Shibata,Kida,Fukamachi,Takeda,

A.Shinohara,T.Shinohara,Arikawa[21]

denotesthetotallengthofstringsinantidictionaryM,andristhenumberofpatternoccurrences.SinceMisapartofthecompressedrepresentationoftext,thetextscanningtimeisO(󰀆M󰀆+n+r),whichislinearinthecompressedtextlength󰀆M󰀆+n,whenignoringr.Moreover,inthecasewhereasetoftextfilesshareacommonantidictionary[8],wecanregardtheO(󰀆M󰀆)timeprocessingofMasapreprocessing.ThentheO(n+r)timetextscanningwillbefastinpractice.Theproposedalgorithmthushasdesirableproperties.

2Preliminaries

Stringsx,y,andzaresaidtobeaprefix,factor,andsuffixofthestringu=xyz,respectively.Thesetsofprefixes,factors,andsuffixesofastringuaredenotedbyPrefix(u),Factor(u),andSuffix(u),respectively.Aprefix,factor,andsuffixofastringuissaidtobeproperifitisnotu.Thelengthofastringuisdenotedby|u|.Theemptystringisdenotedbyε,thatis,|ε|=0.Theithsymbolofastringuisdenotedbyu[i]for1≤i≤|u|,andthefactorofastringuthatbeginsatpositioniandendsatpositionjisdenotedbyu[i:j]for1≤i≤j≤|u|.ThereversedstringofastringuisdenotedbyuR.ThetotallengthofstringsofasetSisdenotedby󰀆S󰀆.Forstringsxandy,denotebyOcc(x,y)thesetofoccurrencesofxiny.Thatis,

󰀃󰀅󰀆

Occ(x,y)=|x|≤i≤|y|󰀃x=y[i−|x|+1:i].Thenextlemmafollowsfromtheperiodicitylemma.

Lemma1.IfOcc(x,y)hasmorethantwoelementsandthedifferenceofthemaximumandtheminimumelementsisatmost|x|,thenitformsanarithmeticprogression,inwhichthestepisthesmallestperiodofx.

PatternMatchinginTextCompressedbyUsingAntidictionaries

0039

10210100113040, 1

15106170, 1

80, 1911001110120, 1

130, 1Fig.1.AutomatonA(M)forM={0000,111,011,0101,1100}.Circlesandsquaresdenotethefinalandthenonfinalstates,respectively.Shadedcirclesdenotethepredictstates.

3TextCompressionUsingAntidictionary

InthissectionwedescribethetextcompressionschemerecentlyproposedbyCrochemoreetal.[8].3.1

Method

LetB={0,1}.SupposethatT∈B+bethetexttobecompressed.AforbiddenwordforTisastringu∈B+thatisnotafactorofT.Aforbiddenwordissaidtobeminimalifithasnoproperfactorthatisforbidden.AnantidictionaryforTisasetofminimalforbiddenwordsforT.

LetMbeanantidictionaryforT.ThenthetextTisinthesetB∗\\B∗MB∗.TheautomatonacceptingthesetB∗\\B∗MB∗canbebuiltfromMinO(󰀆M󰀆)timeinasimilarwaytotheconstructionoftheAho-Corasickpatternmatchingmachine[1].Wedenotetheautomatonby

A(M)=(Q,B,δ,ε,M),

whereQ=Prefix(M)isthesetofstates;Bisthealphabet;δisthestatetransitionfunctionfromQ×BtoQdefinedas

󰀁u,ifu∈M;

δ(u,a)=

longeststringinQ∩Suffix(ua),otherwise;εistheinitialstate;Misthesetoffinalstates.Figure1showstheautomatonA(M)forM={0000,111,011,0101,1100},whichisanantidictionaryfortextT=11010001.

40YusukeShibataetal.

Theencoderandthedecoderinthiscompressionschemeareobtaineddi-rectlyfromtheautomatonA(M).TheencoderE(M)isageneralizedsequentialmachinebasedonA(M)withoutputfunctionλ:Q×Bdefinedby

󰀁

a,ifDeg(u)=2;

λ(u,a)=

ε,otherwise,

󰀃󰀃󰀃whereDeg(u)={a∈B|δ(u,a)∈M}󰀃.ThedecoderD(M)isageneralized

sequentialmachineobtainedbyswappingtheinputlabelandtheoutputlabeloneacharcoftheencoderE(M).Figure2illustratesthemoveoftheencoderE(M)basedonA(M)ofFig.1whichtakesasinputT=11010001andemits110.Itshouldbenotedthat,anyprefixof1101000100withlengthgreaterthan6iscompressedintothesamestring110.ForadecompressionwethereforeneedthelengthofTtogetherwiththeencodedstringitself.Formally,thecompressedrepresentationofTisatriple󰀘M,b1...bn,N󰀁,whereMisanantidictionary,b1...bnisoutputfromtheencoder,andNisthelengthofT.

LetusdenotebyMF(T)thesetofallminimalforbiddenwordsforT.Inthecaseofbinaryalphabetwehave|MF(T)|≤2·|T|−2asshownin[7].Toshortentherepresentationsizeoftheabovetriple,weneedawaytobuilda‘good’antidictionaryasasubsetofMF(T).Crochemoreetal.presentedin[8]asimplemethodinwhichantidictionaryisthesetofforbiddenwordsoflengthatmostk,wherekisaparameter.Itisreportedin[8]thatthecompressionratioinpracticeiscomparabletopkzip.

input:11010001state:0→9→10→11→5→6→2→3→5output:11εεεε0ε

Fig.2.MoveofencoderE(M)forT=11010001.

3.2Decoderwithoutε-Moves

NotethatthedecoderD(M)mentionedabovehasε-moves.Forasimplepresen-tationofouralgorithm,weshalldefineageneralizedsequentialmachineG(M)

obtainedbyeliminatingtheε-movesfromthedecoderD(M).

LetuspartitionthesetQintofourdisjointsubsetsM,Q0,Q1,andQ2by

󰀃󰀅󰀆

(i=0,1,2).Qi=u∈Q\\M󰀃Deg(u)=iAstatepinQ1iscalledapredictstatebecauseoftheuniquenessofoutgoing

arcwhenignoringthearcsintostatesinM.Namely,thereexistsexactlyonesymbolasuchthatδ(p,a)∈M.WedenotesuchsymbolabyNextSymbol(p),anddenotebyNextState(p)thestateδ(p,a).

Consider,forp∈Q1,thesequencep1,p2,...ofstatesinQ1definedbyp1=pandpi+1=NextState(pi)(i=1,2,...).Therearetwocases:Oneisthecasethat

PatternMatchinginTextCompressedbyUsingAntidictionaries41

thereexistsanintegerm>0suchthat,fori=1,2,...,m−1,pi∈Q1,andpm∈Q0∪Q2.Theotheristhecaseofnosuchintegerm,namely,thesequencecontinuesinfinitely.Letuscallthesequencethepredictpathofp,anddenotebyTerminal(p)thelaststatepm.Intheinfinitecase,letTerminal(p)=⊥,where⊥isaspecialstatenotinQ.(Therefore,Terminal(p)∈Q0∪Q2∪{⊥}.)Thefinite/semi-infinitestringspelledoutbythepredictpathofp∈Q1isdenotedbySequence(p).Itiseasytoseethat:

Lemma2.Foranyp∈Q1,thereexistu,v∈B∗with|uv|<|Q1|suchthat

Sequence(p)=uvv···.

NowwearereadytodefineageneralizedsequentialmachineG(M),wherethesetofstatesisQ0∪Q2∪{⊥};thestatetransitionfunctionisδG:Q2×B→Q0∪Q2∪{⊥}definedby

󰀁

Terminal(δ(u,a)),δ(u,a)∈Q1;

δG(u,a)=

δ(u,a),otherwise;theoutputfunctionisλG:Q2×B→B+∪B∞definedby󰀁

a·Sequence(δ(u,a)),δ(u,a)∈Q1;

λG(u,a)=

a,otherwise,

whereB∞denotesthesetofsemi-infinitestringsoverB.Figure3showsthe

decoderG(M)obtainedinthiswayfromtheautomatonA(M)ofFig.1.

DecompressionalgorithmusingG(M)isshowninFig.4.Itshouldbeem-phasizedthat,ifthedecoderG(M)entersastateqandthenreadsasymbolasuchthatλG(q,a)issemi-infinite,thesymbolisthelastsymboloftheoutputfromtheencoderE(M).InthiscasethedecoderG(M)haltsafteremittinganappropriatelengthprefixofλG(q,a)accordingtothevalueofN.

4MainResult

Generally,mostoftextcompressionmethodscanberecognizedasmechanismstofactorizeatextintoseveralblocksasT=u1u2...unandtostorease-quenceof‘representations’ofblocksui.IntheLZWcompression,forexample,

0/01001/10000/010/00/01/10021/11/101009Fig.3.DecoderG(M)forM={0000,111,011,0101,1100}.

42YusukeShibataetal.

Input.Acompressedrepresentation󰀎M,b1...bn,N󰀋ofatextT=T[1:N].Output.TextT.begin

󰀇:=0;q:=ε;

fori:=1ton−1dobegin

u:=λG(q,bi);q:=δG(q,bi);󰀇:=󰀇+|u|;printuend;

u:=λG(q,bn);

printtheprefixofuwithlengthN−󰀇end.

Fig.4.DecompressionbyG(M).

therepresentationofablockuiisjustanintegerwhichindicatesthenodeofdictionarytrierepresentingthestringui.Inthecaseofthecompressionusingantidictionaries,thewayofrepresentationofblockisslightlycomplicated.

ConsiderhowtosimulatethemoveoftheKMPautomatonforapatternPrunningontheuncompressedtextT.LetδKMP:{0,1,...,m}×B→{0,1,...,m}bethestatetransitionfunctionoftheKMPautomatonforP=P[1:m].WeextendδKMPtothedomain{0,1,...,m}×B∗inthestandardmanner.WealsodefinethefunctionλKMPon{0,1,...,m}×B∗by

󰀃󰀅󰀆

λKMP(j,u)=1≤i≤|u|󰀃PisasuffixofstringP[1:j]·u[1:i].Wewanttodeviseapatternmatchingalgorithmwhichtakesasinputasequenceofrepresentationsofblocksu1,u2,...,unofTandreportsalloccurrencesofPinTinO(n+r)time,wherer=|Occ(P,T)|.ThenweneedamechanismforobtaininginO(1)timethevalueδKMP(j,u)andalinearsizerepresentationofthesetλKMP(j,u).InthecaseoftheLZWcompressionsuchmechanismcanberealizedinO(m2+n)timeusingO(m2+n)spaceasstatedin[4]and[17].Similarideacanalsobeappliedtothecaseoftextcompressionbyantidictionaries,exceptthatblockui,whichwillbeaninputtothesecondargumentsofδKMPandλKMP,isrepresentedinadifferentmanner.

InourcaseablockuiisrepresentedasapairofthecurrentstateqofG(M)andthefirstsymbolbiofui.ThereforewehavetokeepthestatetransitionsofG(M).AnoverviewofouralgorithmisshowninFig.5.ThealgorithmmakesG(M)runonb1...bntoknowinputsu1,u2,...,untotheKMPautomatonbeingsimulated.Figure6illustratesthemoveofthealgorithmsearchingthecompressedtext110forthepatternP=0001.

Wehavethefollowingtheoremswhichwillbeprovedinthenextsection.

PatternMatchinginTextCompressedbyUsingAntidictionaries

Input.

Acompressedrepresentation󰀎M,b1b2...bn,N󰀋ofatextT=T[1:N],andapatternP=P[1:m].

AllpositionsatwhichPoccursinT.

43

Output.begin

/*Preprocessing*/

ConstructtheKMPautomataandthesuffixtriesforPandPR;ConstructtheautomatonA(M)fromM;

ConstructthepredictpathgraphfromA(M);

PerformtheprocessingrequiredforδG,δKMP,andλKMP(SeeSection5.);/*Textscanning*/󰀇:=0;q:=ε;state:=0;

fori:=1ton−1dobegin

u:=λG(q,bi);q:=δG(q,bi);

foreachp∈λKMP(state,u)do

Reportapatternoccurrencethatendsatposition󰀇+p;state:=δKMP(state,u);󰀇:=󰀇+|u|end;

u:=λG(q,bn);

foreachp∈λKMP(state,u)suchthat󰀇+p≤Ndo

Reportapatternoccurrencethatendsatposition󰀇+p

end.

Fig.5.Patternmatchingalgorithm.

Theorem1.Thefunctionwhichtakesasinput(q,a)∈Q2×BandreturnsinO(1)timethevalueδG(q,a),canberealizedinO(󰀆M󰀆)timeusingO(󰀆M󰀆)space.

Theorem2.Thefunctionwhichtakesasinputatriple(j,q,a)∈{0,...,m}×Q2×BandreturnsinO(1)timethevalue

δKMP(j,u)

(u=λG(q,a)),

canberealizedinO(󰀆M󰀆+m2)timeusingO(󰀆M󰀆+m2)space.

Theorem3.Thefunctionwhichtakesasinputatriple(j,q,a)∈{0,...,m}×Q2×BandreturnsinO(1)timealinearsizerepresentationoftheset

λKMP(j,u)

(u=λG(q,a)),

canberealizedinO(󰀆M󰀆+m2)timeusingO(󰀆M󰀆+m2)space.Thenwehavethefollowingresult.

44YusukeShibataetal.

input:110stateofG(M):0−→9−→2−→2u:1101000100stateofKMPautomaton:0−→0−→2−→2output:∅∅{8}

Fig.6.MoveofpatternmatchingalgorithmwhenT=110100010andP=0001.Theorem4.Theproblemofcompressedpatternmatchingforthetextcompres-sionusingantidictionariescanbesolvedinO(󰀆M󰀆+n+m2+r)timeusing

O(󰀆M󰀆+m2)space.

5AlgorithminDetail

ThissectiongivesadetailedpresentationofthealgorithmtoproveTheorems1,2,and3.5.1

ProofofTheorem1

ForarealizationofδG,wehavetofind,foreachq∈Q0∪Q2∪{⊥},thepairs(p,b)∈Q2×Bsuchthatδ(p,b)=p󰀅∈Q1andTerminal(p󰀅)=q.Firstofall,wementionthegraphconsistingofthepredictpaths,whichplaysanimportantroleinthisproof.

ConsiderthesubgraphofA(M)inwhichthearcsarelimitedtotheoutgoingarcsfrompredictnodes.Weaddauxiliarynodesv=󰀘p,b󰀁andnewarcslabelledbfromvtoq∈Q1suchthatp∈Q2,b∈B,andδ(p,b)=qtothesubgraph.Wecalltheresultinggraphpredictpathgraph.Figure7showsthepredictpathgraphobtainedfromA(M)ofFig.1.

Thepredictpathgraphillustrates,for(p,b)∈Q2×B,thestringλG(p,b)asapathwhichstartsattheauxiliarynode󰀘p,b󰀁,passesthroughnodesinQ1,andeitherfinallyencountersanodeinQ0∪Q2,orflowsintoaloopconsistingonlyofnodesinQ1.Aconnectedcomponentofthepredictpathgraphfallsintotwoclasses:(a)atreewhichhasasrootanodeinQ0∪Q2andhasasleaves

<2,1><2,0>0<9,1>11311<1,1>11110050602Fig.7.Predictpathgraph.Rectanglesdenotetheauxiliarynodes.

PatternMatchinginTextCompressedbyUsingAntidictionaries45

(a)(b)

Fig.8.Connectedcomponentsofpredictpathgraph.

auxiliarynodes,and(b)aloopwithtrees,eachofwhichhasasrootanodeontheloopandhasleavesauxiliarynodes.SeeFig.8.

NowwearereadytoproveTheorem1.ConstructionofδGisasfollows:First,wesetδG(p,b)=δ(p,b)forevery(p,b)∈Q2×Bwithδ(p,b)∈Q0∪Q2.Next,foreverynodeq∈Q0∪Q2ofthepredictpathgraph,wetraversethetreethathasqasroot.Notethattheleavesofthetreeareauxiliarynodes󰀘p,b󰀁suchthatTerminal(δ(p,b))=q,andwecansetδG(p,b)=q.Finally,foreverynodeqonloopsofthepredictpathgraph,wetraversethetreethathasqasroot.Theleavesofthetreeareauxiliarynodes󰀘p,b󰀁suchthatTerminal(δ(p,b))=⊥,andhencewesetδG(p,b)=⊥.Thetotaltimecomplexityislinearinthenumberofnodesofthepredictpathgraph,i.e.O(󰀆M󰀆).Theproofisnowcomplete.5.2

ProofofTheorem2

Inthefollowingdiscussions,wearefrequentlyfacedwiththeneedtogetsomevalueasafunctionofu,thestringsthatarespelledoutbythepathsfromauxiliarynodes.Evenwhenthevalueforeachpathcanbecomputedintimeproportionaltothepathlength,thetotaltimecomplexityisnotO(󰀆M󰀆)sincemorethanonepathcansharecommonarcs.

Supposethatthevalueforeachpathcanbecomputedbymakinganau-tomatonrunonthepathinthereversedirection.Then,wecancomputethevaluesforsuchpathsbytraversingeverytreeinthedepth-first-orderusingastack.Sincethismethodenablesusto‘share’thecomputationforacommonsuffixoftwostrings,thetotaltimecomplexityislinearinthenumberofarcs,i.e.O(󰀆M󰀆).Thistechniqueplaysakeyroleinthefollowingproofs.

Foranintegerjwith0≤j≤mandforafactoruofP,letusdenotebyN1(j,u)thelargestintegerkwith0≤k≤jsuchthatP[j−k+1:j]·uisaprefixofP.LetN1(j,u)=nil,ifnosuchintegerexists.Then,wehave:

󰀁

N1(j,u)+|u|,ifuisafactorofPandN1(j,u)=nil;

δKMP(j,u)=

δKMP(0,u),otherwise.WeassumethatthesecondargumentuofN1isgivenasanodeofthesuffixtrieforP.Amiretal.[4]showedthefollowingfact.

Lemma3(Amiretal.1996).Thefunctionwhichtakesasinput(j,u)∈{0,...,m}×Factor(P)andreturnsthevalueN1(j,u)inO(1)time,canbere-alizedinO(m2)timeusingO(m2)space.

46YusukeShibataetal.

Wehavealsothenextlemma.

Lemma4.Thefunctionwhichtakesasinput(q,a)∈Q2×Bandreturnsu=λG(q,a)asanodeofthesuffixtrieforPwhenu∈Factor(P),canberealizedinO(󰀆M󰀆+m2)timeusingO(󰀆M󰀆+m2)space.

Proof.Weusethetechniquementionedabove.Wecanignoretheinfinitestrings.Thatis,wecanignorethetreesinwhicharootisonaloop.ConsidertheproblemofdeterminingwhetheruRisafactorofPR.ItcanbesolvedinO(min{|u|,m})timeusingthesuffixtrieforPR.IfuRisafactorofPR,thenodeuofthesuffixtrieforPcanbedetermineddirectlyfromthenodeuRofthesuffixtrieforPRassumingatrivialone-to-onemappingbetweenthetwosuffixtries,whichcanbecomputedinO(m2)time.

Lemma5.Thefunctionwhichtakesasinput(q,a)∈Q2×Bsuchthatu=λG(q,a)isfiniteandreturnsinO(1)timethevalueδKMP(0,u),canberealizedinO(󰀆M󰀆+m)timeusingO(󰀆M󰀆+m)space.

Proof.Weusethetechniquementionedaboveagain.WehavetoconsidertheproblemoffindingthelengthoflongestsuffixofuthatisalsoaprefixofP.ThisisequivalenttofindingthelengthoflongestprefixofuRthatisalsoasuffixofPR.ItissolvedinO(min{|u|,m})timeusingthesuffixtreeforPR.Wecanignorethetreesinwhicharootisonaloop.Theorem2followsfromthelemmasabove.5.3

ProofofTheorem3

AccordingtowhetherapatternoccurrencecoverstheboundarybetweenthestringsP[1:j]andu,wecanpartitionthesetλKMP(j,u)intotwodisjointsubsetsasfollows.

˜)∪X(u),λKMP(j,u)=λKMP(j,uwhere

󰀃󰀅󰀆

X(u)=|P|≤i≤|u|󰀃Pisasuffixofu[1:i],

andu˜isthelongestprefixofuthatisalsoapropersuffixofP.Let

󰀂󰀄

Y(j,󰀉)=OccP,P[1:j]·P[m−󰀉+1:m]󰀒j,

˜)=where󰀒denotestheelement-wisesubtraction.ItiseasytoseeλKMP(j,u

Y(j,|u˜|).ItfollowsfromLemma1thatthesetY(j,󰀉)hasthefollowingproperty:Lemma6.IfY(j,󰀉)hasmorethantwoelements,itformsanarithmeticpro-gression,wherethestepisthesmallestperiodofP.

PatternMatchinginTextCompressedbyUsingAntidictionaries47

Lemma7.Thefunctionwhichtakesasinput(j,󰀉)∈{0,...,m}×{0,...,m}andreturnsinO(1)timeanO(1)spacerepresentationofthesetY(j,󰀉),canberealizedinO(m2)timeusingO(m2)space.

Proof.ItfollowsfromLemma6thatY(j,󰀉)canbestoredinO(1)spaceasapairoftheminimumandthemaximumvaluesinit.ThetablestoringtheminimumvaluesofY(j,󰀉)forall(j,󰀉)canbecomputedinO(m2)timeasstatedin[4].(TableN2definedin[4]satisfiesmin(Y(j,󰀉))=m−N2(j,󰀉).)ByreversingthepatternP,thetablethemaximumvaluesisalsocomputedinO(m2)time.ThesmallestperiodofPiscomputedinO(m)time.

Lemma8.Thefunctionwhichtakesasinput(q,a)∈Q2×BandreturnsinO(1)timethevalue|u˜|withu=λG(q,a),canberealizedinO(󰀆M󰀆+m)timeusingO(󰀆M󰀆+m)space.

Proof.WeshallconsidertheproblemoffindingthelengthoflongestsuffixofuRthatisalsoaproperprefixofPR.ThiscanbesolvedbyusingtheKMPautomatonforPR.Butwehavetoconsiderthecasewhereuissemi-infinite.Inthefinitestringcase,wemaketheautomatonstartattherootoftreewithinitialstate.Butintheinfinitestringcase,wemustchangethevalueoftheinitialstate.Letvbethestringspelledoutbytheloopstartingattherootofthetreebeingconsidered.Wemustpayattentiontothecasewhereapatternsuffixisalsoaprefixofthestringv󰀁with󰀉>0.Todeterminethecorrectvalueoftheinitialstateattherootnode,wemaketheautomatongoaroundtheloopexactly󰀉timesandstopitattherootnodethatisthestartingpoint,where󰀉isthesmallestintegerwith󰀉·|v|>|P|.Thestateoftheautomatonatthatmomentisthedesiredvalue.

Lemma9.Thefunctionwhichtakesasinput(q,a)∈Q2×BandreturnsinO(1)timealinearsizerepresentationofthesetX(u)withu=λG(q,a),canberealizedinO(󰀆M󰀆+m)timeusingO(󰀆M󰀆+m)space.

Proof.ByusingtheKMPautomatonforthereversedpattern,wemarkthepredictnodesatwhichthepatternbegins.Supposethateverypredictnodehasapointertothenearestproperancestorthatismarked.SuchpointersarerealizedusingO(󰀆M󰀆)timeandspace.ThisenablesustogettheelementsofX(u)inO(|X(u)|)time.

Theorem3followsfromthelemmasabove.

6ConcludingRemarks

InthispaperwefocusedontheproblemofcompressedpatternmatchingforthetextcompressionusingantidictionariesproposedrecentlyCrochemoreetal.[8].Wepresentedanalgorithmwhichhasalineartimecomplexityproportional

48YusukeShibataetal.

tothecompressedtextlength,whenweexcludethepatternpreprocessing.Wearenowimplementingthealgorithmtoevaluateitsperformancefrompracticalviewpoints.In[16]weshowedthattheShift-AndapproachiseffectiveinthecompressedpatternmatchingfortheLZWcompression.WethinkthattheShift-AndapproachwillbesubstitutedfortheKMPautomatonapproachpresentedinthispaperandshowagoodperformaceinpracticewhenthepatternlengthmisnotsolarge,saym≤32.

Foralongpatternwecanalsoconsiderthefollowingmethod.Letkbethelengthofthelongestforbiddenwordintheantidictionary.Byusingthesyncronizingproperty[8],weobtain:

Lemma10.If|P|≥k−1,thenδ(u,P)=δ(ε,P)foranystateuinQsuchthatδ(u,P)∈M.

Letp=δ(ε,P).Sincep∈MimpliesthatPcannotoccurinT,wecanassumep∈M.IfpisinQ1,thenletq=Terminal(p).Otherwise,letq=p.WecanmonitorwhetherthestateofA(M)isinstatepbyusingthefunctionδGtocheckG(M)isinstateq.Ifso,weshallconfirmit.Ourpreliminaryexperimentssuggestthatthissearchmethodisefficientinpractice.

References

1.A.V.AhoandM.Corasick.Efficientstringmatching:Anaidtobibliographicsearch.Comm.ACM,18(6):333–340,1975.

2.A.AmirandG.Benson.Efficienttwo-dimensionalcompressedmatching.InProc.DataCompressionConference’92,page279,1992.

3.A.AmirandG.Benson.Two-dimensionalperiodicityanditsapplication.InProc.3rdAnn.ACM-SIAMSymp.onDiscreteAlgorithms,pages440–452,1992.

4.A.Amir,G.Benson,andM.Farach.Letsleepingfileslie:PatternmatchinginZ-compressedfiles.JournalofComputerandSystemSciences,52:299–307,1996.5.A.Amir,G.Benson,andM.Farach.Optimaltwo-dimensionalcompressedmatch-ing.JournalofAlgorithms,24(2):354–379,1997.

6.A.Amir,G.M.Landau,andU.Vishkin.Efficientpatternmatchingwithscaling.JournalofAlgorithms,13(1):2–32,1992.

7.M.Crochemore,F.Mignosi,andA.Restivo.Minimalforbiddenwordsandfactorautomata.InL.Brim,J.Gruska,andJ.Zlatuska,editors,Proc.23rdInternationialSymp.onMathematicalFoundationsofComputerScience,volume1450ofLectureNotesinComputerScience,pages665–673.Springer-Verlag,1998.

8.M.Crochemore,F.Mignosi,A.Restivo,andS.Salemi.Textcompressionusingantidictionaries.TechnicalReportIGM-98-10,InstitutGaspard-Monge,1998.9.E.S.deMoura,G.Navarro,N.Ziviani,andR.Baeza-Yates.Directpatternmatch-ingoncompressedtext.InProc.5thInternationalSymp.onStringProcessingandInformationRetrieval,pages90–95.IEEEComputerSociety,1998.

10.E.S.deMoura,G.Navarro,N.Ziviani,andR.Baeza-Yates.Fastsequencial

searchingoncompressedtextsallowingerrors.InProc.21stAnn.InternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages298–306.YorkPress,1998.

11.T.Eilam-TzoreffandU.Vishkin.Matchingpatternsinstringssubjecttomulti-lineartransformations.TheoreticalComputerScience,60(3):231–254,1988.

PatternMatchinginTextCompressedbyUsingAntidictionaries49

12.M.FarachandM.Thorup.String-matchinginLempel-Zivcompressedstrings.In

Proc.27thAnn.ACMSymp.onTheoryofComputing,pages703–713,1995.

13.S.Fukamachi,T.Shinohara,andM.Takeda.Stringpatternmatchingforcom-presseddatausingvariablelengthcodes.Submitted,1998.14.L.G¸asieniec,M.Karpinski,W.Plandowski,andW.Rytter.Efficientalgorithms

forLempel-Zivencoding.InProc.4thScandinavianWorkshoponAlgorithmThe-ory,volume1097ofLectureNotesinComputerScience,pages392–403.Springer-Verlag,1996.

15.M.Karpinski,W.Rytter,andA.Shinohara.Anefficientpattern-matchingalgo-rithmforstringswithshortdescriptions.NordicJournalofComputing,4:172–186,1997.

16.T.Kida,M.Takeda,A.Shinohara,andS.Arikawa.Shift-Andapproachtopattern

matchinginLZWcompressedtext.InProc.10thAnn.Symp.onCombinatorialPatternMatching,LectureNotesinComputerScience.Springer-Verlag,1999.toappear.

17.T.Kida,M.Takeda,A.Shinohara,M.Miyazaki,andS.Arikawa.Multiplepattern

matchinginLZWcompressedtext.InProc.DataCompressionConference’98,pages103–112.IEEEComputerSociety,1998.

18.U.Manber.Atextcompressionschemethatallowsfastsearchingdirectlyin

thecompressedfile.InProc.5thAnn.Symp.onCombinatorialPatternMatching,volume807ofLectureNotesinComputerScience,pages113–124.Springer-Verlag,1994.

19.M.Miyazaki,S.Fukamachi,M.Takeda,andT.Shinohara.Speedingupthepattern

matchingmachineforcompressedtexts.TransactionsofInformationProcessingSocietyofJapan,39(9):2638–2648,1998.(inJapanese).

20.M.Miyazaki,A.Shinohara,andM.Takeda.Animprovedpatternmatchingal-gorithmforstringsintermsofstraight-lineprograms.InProc.8thAnn.Symp.onCombinatorialPatternMatching,volume1264ofLectureNotesinComputerScience,pages1–11.Springer-Verlag,1997.

21.Y.Shibata,T.Kida,S.Fukamachi,M.Takeda,A.Shinohara,T.Shinohara,and

S.Arikawa.Bytepairencoding:atextcompressionschemethatacceleratespatternmatching.TechnicalReportDOI-TR-161,DepartmentofInformatics,KyushuUniversity,April1999.

22.M.Takeda.Patternmatchingmachinefortextcompressedusingfinitestatemodel.

TechnicalReportDOI-TR-142,DepartmentofInformatics,KyushuUniversity,October1997.

因篇幅问题不能全部显示,请点此查看更多更全内容

Top