byUsingAntidictionaries
YusukeShibata,MasayukiTakeda,AyumiShinohara,andSetsuoArikawa
DepartmentofInformatics,KyushuUniversity33
Fukuoka812-8581,Japan
{yusuke,takeda,ayumi,arikawa}@i.kyushu-u.ac.jp
http://www.i.kyushu-u.ac.jp
Abstract.Inthispaperwefocusontheproblemofcompressedpatternmatchingforthetextcompressionusingantidictionaries,whichisanewcompressionschemeproposedrecentlybyCrochemoreetal.(1998).WeshowanalgorithmwhichpreprocessesapatternoflengthmandanantidictionaryMinO(m2+M)time,andthenscansacompressedtextoflengthninO(n+r)timetofindallpatternoccurrences,whereMisthetotallengthofstringsinMandristhenumberofthepatternoccurrences.
1Introduction
Compressedpatternmatchingisoneofthemostinterestingtopicsinthecom-binatorialpatternmatching,andmanystudieshavebeenundertakenonthisproblemforseveralcompressionmethodsfromboththeoreticalandpracticalviewpoints.SeeTable1.Oneimportantgoalofcompressedpatternmatchingistoachievealineartimecomplexitythatisproportionalnottotheoriginaltextlengthbuttothecompressedtextlength.
Recently,Crochemoreetal.proposedanewcompressionscheme:textcom-pressionusingantidictionary[8].Contrarytothecompressionmethodsthatmakeuseofdictionaries,whichareparticularsetsofstringsoccurringintexts,thenewschemeexploitsanantidictionarythatisafinitesetofstringsthatdonotoccurasfactorsintext,i.e.thatareforbidden.Leta1...an∈{0,1}+bethetexttobecompressed.Supposewehavereadaprefixa1...ajatacertainmoment.Ifthestringai...ajb(i≤j,b∈{0,1})isaforbiddenword,namely,isintheantidictionary,thenthenextsymbolaj+1cannotbeb.Inotherwords,thenextsymbolaj+1ispredictable.Basedonthisidea,thecompressionmethodremovessuchpredictablesymbolsfromthetext.Thecompressionandthede-compressionareperformedbyusingtheautomatonacceptingthesetofstringsinwhichnoforbiddenwordsoccurasfactors.
Inthispaperwefocusontheproblemofcompressedpatternmatchingforthetextcompressionusingantidictionaries.WepresentanalgorithmthatsolvestheprobleminO(m2+M+n+r)timeusingO(m2+M)space,wheremandnarethepatternlengthandthecompressedtextlength,respectively,M
M.Crochemore,M.Paterson(Eds.):CPM’99,LNCS1645,pp.37–49,1999.cSpringer-VerlagBerlinHeidelberg1999
38YusukeShibataetal.
Table1.Compressedpatternmatching.
compressionmethodcompressedpatternmatchingalgorithmsrun-lengthEilam-TzoreffandVishkin[11]
run-length(twodim.)Amir,Landau,andVishkin[6];AmirandBenson
[2,3];Amir,Benson,andFarach[5]
LZ77FarachandThorup[12];G¸asieniec,Karpinski,
Plandowski,andRytter[14]
LZWAmir,Benson,andFarach[4];Kida,Takeda,Shi-nohara,Miyazaki,andArikawa[17];Kida,Takeda,
Shinohara,andArikawa[16]
straight-lineprogramKarpinski,Rytter,andShinohara[15];Miyazaki,
Shinohara,andTakeda[20]
HuffmanFukamachi,Shinohara,andTakeda[13];Miyazaki,
Fukamachi,Takeda,andShinohara[19]
finitestateencodingTakeda[22]
wordbasedencodingMoura,Navarro,Ziviani,andBaeza-Yates[9,10]patternsubstitutionManber[18];Shibata,Kida,Fukamachi,Takeda,
A.Shinohara,T.Shinohara,Arikawa[21]
denotesthetotallengthofstringsinantidictionaryM,andristhenumberofpatternoccurrences.SinceMisapartofthecompressedrepresentationoftext,thetextscanningtimeisO(M+n+r),whichislinearinthecompressedtextlengthM+n,whenignoringr.Moreover,inthecasewhereasetoftextfilesshareacommonantidictionary[8],wecanregardtheO(M)timeprocessingofMasapreprocessing.ThentheO(n+r)timetextscanningwillbefastinpractice.Theproposedalgorithmthushasdesirableproperties.
2Preliminaries
Stringsx,y,andzaresaidtobeaprefix,factor,andsuffixofthestringu=xyz,respectively.Thesetsofprefixes,factors,andsuffixesofastringuaredenotedbyPrefix(u),Factor(u),andSuffix(u),respectively.Aprefix,factor,andsuffixofastringuissaidtobeproperifitisnotu.Thelengthofastringuisdenotedby|u|.Theemptystringisdenotedbyε,thatis,|ε|=0.Theithsymbolofastringuisdenotedbyu[i]for1≤i≤|u|,andthefactorofastringuthatbeginsatpositioniandendsatpositionjisdenotedbyu[i:j]for1≤i≤j≤|u|.ThereversedstringofastringuisdenotedbyuR.ThetotallengthofstringsofasetSisdenotedbyS.Forstringsxandy,denotebyOcc(x,y)thesetofoccurrencesofxiny.Thatis,
Occ(x,y)=|x|≤i≤|y|x=y[i−|x|+1:i].Thenextlemmafollowsfromtheperiodicitylemma.
Lemma1.IfOcc(x,y)hasmorethantwoelementsandthedifferenceofthemaximumandtheminimumelementsisatmost|x|,thenitformsanarithmeticprogression,inwhichthestepisthesmallestperiodofx.
PatternMatchinginTextCompressedbyUsingAntidictionaries
0039
10210100113040, 1
15106170, 1
80, 1911001110120, 1
130, 1Fig.1.AutomatonA(M)forM={0000,111,011,0101,1100}.Circlesandsquaresdenotethefinalandthenonfinalstates,respectively.Shadedcirclesdenotethepredictstates.
3TextCompressionUsingAntidictionary
InthissectionwedescribethetextcompressionschemerecentlyproposedbyCrochemoreetal.[8].3.1
Method
LetB={0,1}.SupposethatT∈B+bethetexttobecompressed.AforbiddenwordforTisastringu∈B+thatisnotafactorofT.Aforbiddenwordissaidtobeminimalifithasnoproperfactorthatisforbidden.AnantidictionaryforTisasetofminimalforbiddenwordsforT.
LetMbeanantidictionaryforT.ThenthetextTisinthesetB∗\\B∗MB∗.TheautomatonacceptingthesetB∗\\B∗MB∗canbebuiltfromMinO(M)timeinasimilarwaytotheconstructionoftheAho-Corasickpatternmatchingmachine[1].Wedenotetheautomatonby
A(M)=(Q,B,δ,ε,M),
whereQ=Prefix(M)isthesetofstates;Bisthealphabet;δisthestatetransitionfunctionfromQ×BtoQdefinedas
u,ifu∈M;
δ(u,a)=
longeststringinQ∩Suffix(ua),otherwise;εistheinitialstate;Misthesetoffinalstates.Figure1showstheautomatonA(M)forM={0000,111,011,0101,1100},whichisanantidictionaryfortextT=11010001.
40YusukeShibataetal.
Theencoderandthedecoderinthiscompressionschemeareobtaineddi-rectlyfromtheautomatonA(M).TheencoderE(M)isageneralizedsequentialmachinebasedonA(M)withoutputfunctionλ:Q×Bdefinedby
a,ifDeg(u)=2;
λ(u,a)=
ε,otherwise,
whereDeg(u)={a∈B|δ(u,a)∈M}.ThedecoderD(M)isageneralized
sequentialmachineobtainedbyswappingtheinputlabelandtheoutputlabeloneacharcoftheencoderE(M).Figure2illustratesthemoveoftheencoderE(M)basedonA(M)ofFig.1whichtakesasinputT=11010001andemits110.Itshouldbenotedthat,anyprefixof1101000100withlengthgreaterthan6iscompressedintothesamestring110.ForadecompressionwethereforeneedthelengthofTtogetherwiththeencodedstringitself.Formally,thecompressedrepresentationofTisatripleM,b1...bn,N,whereMisanantidictionary,b1...bnisoutputfromtheencoder,andNisthelengthofT.
LetusdenotebyMF(T)thesetofallminimalforbiddenwordsforT.Inthecaseofbinaryalphabetwehave|MF(T)|≤2·|T|−2asshownin[7].Toshortentherepresentationsizeoftheabovetriple,weneedawaytobuilda‘good’antidictionaryasasubsetofMF(T).Crochemoreetal.presentedin[8]asimplemethodinwhichantidictionaryisthesetofforbiddenwordsoflengthatmostk,wherekisaparameter.Itisreportedin[8]thatthecompressionratioinpracticeiscomparabletopkzip.
input:11010001state:0→9→10→11→5→6→2→3→5output:11εεεε0ε
Fig.2.MoveofencoderE(M)forT=11010001.
3.2Decoderwithoutε-Moves
NotethatthedecoderD(M)mentionedabovehasε-moves.Forasimplepresen-tationofouralgorithm,weshalldefineageneralizedsequentialmachineG(M)
obtainedbyeliminatingtheε-movesfromthedecoderD(M).
LetuspartitionthesetQintofourdisjointsubsetsM,Q0,Q1,andQ2by
(i=0,1,2).Qi=u∈Q\\MDeg(u)=iAstatepinQ1iscalledapredictstatebecauseoftheuniquenessofoutgoing
arcwhenignoringthearcsintostatesinM.Namely,thereexistsexactlyonesymbolasuchthatδ(p,a)∈M.WedenotesuchsymbolabyNextSymbol(p),anddenotebyNextState(p)thestateδ(p,a).
Consider,forp∈Q1,thesequencep1,p2,...ofstatesinQ1definedbyp1=pandpi+1=NextState(pi)(i=1,2,...).Therearetwocases:Oneisthecasethat
PatternMatchinginTextCompressedbyUsingAntidictionaries41
thereexistsanintegerm>0suchthat,fori=1,2,...,m−1,pi∈Q1,andpm∈Q0∪Q2.Theotheristhecaseofnosuchintegerm,namely,thesequencecontinuesinfinitely.Letuscallthesequencethepredictpathofp,anddenotebyTerminal(p)thelaststatepm.Intheinfinitecase,letTerminal(p)=⊥,where⊥isaspecialstatenotinQ.(Therefore,Terminal(p)∈Q0∪Q2∪{⊥}.)Thefinite/semi-infinitestringspelledoutbythepredictpathofp∈Q1isdenotedbySequence(p).Itiseasytoseethat:
Lemma2.Foranyp∈Q1,thereexistu,v∈B∗with|uv|<|Q1|suchthat
Sequence(p)=uvv···.
NowwearereadytodefineageneralizedsequentialmachineG(M),wherethesetofstatesisQ0∪Q2∪{⊥};thestatetransitionfunctionisδG:Q2×B→Q0∪Q2∪{⊥}definedby
Terminal(δ(u,a)),δ(u,a)∈Q1;
δG(u,a)=
δ(u,a),otherwise;theoutputfunctionisλG:Q2×B→B+∪B∞definedby
a·Sequence(δ(u,a)),δ(u,a)∈Q1;
λG(u,a)=
a,otherwise,
whereB∞denotesthesetofsemi-infinitestringsoverB.Figure3showsthe
decoderG(M)obtainedinthiswayfromtheautomatonA(M)ofFig.1.
DecompressionalgorithmusingG(M)isshowninFig.4.Itshouldbeem-phasizedthat,ifthedecoderG(M)entersastateqandthenreadsasymbolasuchthatλG(q,a)issemi-infinite,thesymbolisthelastsymboloftheoutputfromtheencoderE(M).InthiscasethedecoderG(M)haltsafteremittinganappropriatelengthprefixofλG(q,a)accordingtothevalueofN.
4MainResult
Generally,mostoftextcompressionmethodscanberecognizedasmechanismstofactorizeatextintoseveralblocksasT=u1u2...unandtostorease-quenceof‘representations’ofblocksui.IntheLZWcompression,forexample,
0/01001/10000/010/00/01/10021/11/101009Fig.3.DecoderG(M)forM={0000,111,011,0101,1100}.
42YusukeShibataetal.
Input.AcompressedrepresentationM,b1...bn,NofatextT=T[1:N].Output.TextT.begin
:=0;q:=ε;
fori:=1ton−1dobegin
u:=λG(q,bi);q:=δG(q,bi);:=+|u|;printuend;
u:=λG(q,bn);
printtheprefixofuwithlengthN−end.
Fig.4.DecompressionbyG(M).
therepresentationofablockuiisjustanintegerwhichindicatesthenodeofdictionarytrierepresentingthestringui.Inthecaseofthecompressionusingantidictionaries,thewayofrepresentationofblockisslightlycomplicated.
ConsiderhowtosimulatethemoveoftheKMPautomatonforapatternPrunningontheuncompressedtextT.LetδKMP:{0,1,...,m}×B→{0,1,...,m}bethestatetransitionfunctionoftheKMPautomatonforP=P[1:m].WeextendδKMPtothedomain{0,1,...,m}×B∗inthestandardmanner.WealsodefinethefunctionλKMPon{0,1,...,m}×B∗by
λKMP(j,u)=1≤i≤|u|PisasuffixofstringP[1:j]·u[1:i].Wewanttodeviseapatternmatchingalgorithmwhichtakesasinputasequenceofrepresentationsofblocksu1,u2,...,unofTandreportsalloccurrencesofPinTinO(n+r)time,wherer=|Occ(P,T)|.ThenweneedamechanismforobtaininginO(1)timethevalueδKMP(j,u)andalinearsizerepresentationofthesetλKMP(j,u).InthecaseoftheLZWcompressionsuchmechanismcanberealizedinO(m2+n)timeusingO(m2+n)spaceasstatedin[4]and[17].Similarideacanalsobeappliedtothecaseoftextcompressionbyantidictionaries,exceptthatblockui,whichwillbeaninputtothesecondargumentsofδKMPandλKMP,isrepresentedinadifferentmanner.
InourcaseablockuiisrepresentedasapairofthecurrentstateqofG(M)andthefirstsymbolbiofui.ThereforewehavetokeepthestatetransitionsofG(M).AnoverviewofouralgorithmisshowninFig.5.ThealgorithmmakesG(M)runonb1...bntoknowinputsu1,u2,...,untotheKMPautomatonbeingsimulated.Figure6illustratesthemoveofthealgorithmsearchingthecompressedtext110forthepatternP=0001.
Wehavethefollowingtheoremswhichwillbeprovedinthenextsection.
PatternMatchinginTextCompressedbyUsingAntidictionaries
Input.
AcompressedrepresentationM,b1b2...bn,NofatextT=T[1:N],andapatternP=P[1:m].
AllpositionsatwhichPoccursinT.
43
Output.begin
/*Preprocessing*/
ConstructtheKMPautomataandthesuffixtriesforPandPR;ConstructtheautomatonA(M)fromM;
ConstructthepredictpathgraphfromA(M);
PerformtheprocessingrequiredforδG,δKMP,andλKMP(SeeSection5.);/*Textscanning*/:=0;q:=ε;state:=0;
fori:=1ton−1dobegin
u:=λG(q,bi);q:=δG(q,bi);
foreachp∈λKMP(state,u)do
Reportapatternoccurrencethatendsatposition+p;state:=δKMP(state,u);:=+|u|end;
u:=λG(q,bn);
foreachp∈λKMP(state,u)suchthat+p≤Ndo
Reportapatternoccurrencethatendsatposition+p
end.
Fig.5.Patternmatchingalgorithm.
Theorem1.Thefunctionwhichtakesasinput(q,a)∈Q2×BandreturnsinO(1)timethevalueδG(q,a),canberealizedinO(M)timeusingO(M)space.
Theorem2.Thefunctionwhichtakesasinputatriple(j,q,a)∈{0,...,m}×Q2×BandreturnsinO(1)timethevalue
δKMP(j,u)
(u=λG(q,a)),
canberealizedinO(M+m2)timeusingO(M+m2)space.
Theorem3.Thefunctionwhichtakesasinputatriple(j,q,a)∈{0,...,m}×Q2×BandreturnsinO(1)timealinearsizerepresentationoftheset
λKMP(j,u)
(u=λG(q,a)),
canberealizedinO(M+m2)timeusingO(M+m2)space.Thenwehavethefollowingresult.
44YusukeShibataetal.
input:110stateofG(M):0−→9−→2−→2u:1101000100stateofKMPautomaton:0−→0−→2−→2output:∅∅{8}
Fig.6.MoveofpatternmatchingalgorithmwhenT=110100010andP=0001.Theorem4.Theproblemofcompressedpatternmatchingforthetextcompres-sionusingantidictionariescanbesolvedinO(M+n+m2+r)timeusing
O(M+m2)space.
5AlgorithminDetail
ThissectiongivesadetailedpresentationofthealgorithmtoproveTheorems1,2,and3.5.1
ProofofTheorem1
ForarealizationofδG,wehavetofind,foreachq∈Q0∪Q2∪{⊥},thepairs(p,b)∈Q2×Bsuchthatδ(p,b)=p∈Q1andTerminal(p)=q.Firstofall,wementionthegraphconsistingofthepredictpaths,whichplaysanimportantroleinthisproof.
ConsiderthesubgraphofA(M)inwhichthearcsarelimitedtotheoutgoingarcsfrompredictnodes.Weaddauxiliarynodesv=p,bandnewarcslabelledbfromvtoq∈Q1suchthatp∈Q2,b∈B,andδ(p,b)=qtothesubgraph.Wecalltheresultinggraphpredictpathgraph.Figure7showsthepredictpathgraphobtainedfromA(M)ofFig.1.
Thepredictpathgraphillustrates,for(p,b)∈Q2×B,thestringλG(p,b)asapathwhichstartsattheauxiliarynodep,b,passesthroughnodesinQ1,andeitherfinallyencountersanodeinQ0∪Q2,orflowsintoaloopconsistingonlyofnodesinQ1.Aconnectedcomponentofthepredictpathgraphfallsintotwoclasses:(a)atreewhichhasasrootanodeinQ0∪Q2andhasasleaves
<2,1><2,0>0<9,1>11311<1,1>11110050602Fig.7.Predictpathgraph.Rectanglesdenotetheauxiliarynodes.
PatternMatchinginTextCompressedbyUsingAntidictionaries45
(a)(b)
Fig.8.Connectedcomponentsofpredictpathgraph.
auxiliarynodes,and(b)aloopwithtrees,eachofwhichhasasrootanodeontheloopandhasleavesauxiliarynodes.SeeFig.8.
NowwearereadytoproveTheorem1.ConstructionofδGisasfollows:First,wesetδG(p,b)=δ(p,b)forevery(p,b)∈Q2×Bwithδ(p,b)∈Q0∪Q2.Next,foreverynodeq∈Q0∪Q2ofthepredictpathgraph,wetraversethetreethathasqasroot.Notethattheleavesofthetreeareauxiliarynodesp,bsuchthatTerminal(δ(p,b))=q,andwecansetδG(p,b)=q.Finally,foreverynodeqonloopsofthepredictpathgraph,wetraversethetreethathasqasroot.Theleavesofthetreeareauxiliarynodesp,bsuchthatTerminal(δ(p,b))=⊥,andhencewesetδG(p,b)=⊥.Thetotaltimecomplexityislinearinthenumberofnodesofthepredictpathgraph,i.e.O(M).Theproofisnowcomplete.5.2
ProofofTheorem2
Inthefollowingdiscussions,wearefrequentlyfacedwiththeneedtogetsomevalueasafunctionofu,thestringsthatarespelledoutbythepathsfromauxiliarynodes.Evenwhenthevalueforeachpathcanbecomputedintimeproportionaltothepathlength,thetotaltimecomplexityisnotO(M)sincemorethanonepathcansharecommonarcs.
Supposethatthevalueforeachpathcanbecomputedbymakinganau-tomatonrunonthepathinthereversedirection.Then,wecancomputethevaluesforsuchpathsbytraversingeverytreeinthedepth-first-orderusingastack.Sincethismethodenablesusto‘share’thecomputationforacommonsuffixoftwostrings,thetotaltimecomplexityislinearinthenumberofarcs,i.e.O(M).Thistechniqueplaysakeyroleinthefollowingproofs.
Foranintegerjwith0≤j≤mandforafactoruofP,letusdenotebyN1(j,u)thelargestintegerkwith0≤k≤jsuchthatP[j−k+1:j]·uisaprefixofP.LetN1(j,u)=nil,ifnosuchintegerexists.Then,wehave:
N1(j,u)+|u|,ifuisafactorofPandN1(j,u)=nil;
δKMP(j,u)=
δKMP(0,u),otherwise.WeassumethatthesecondargumentuofN1isgivenasanodeofthesuffixtrieforP.Amiretal.[4]showedthefollowingfact.
Lemma3(Amiretal.1996).Thefunctionwhichtakesasinput(j,u)∈{0,...,m}×Factor(P)andreturnsthevalueN1(j,u)inO(1)time,canbere-alizedinO(m2)timeusingO(m2)space.
46YusukeShibataetal.
Wehavealsothenextlemma.
Lemma4.Thefunctionwhichtakesasinput(q,a)∈Q2×Bandreturnsu=λG(q,a)asanodeofthesuffixtrieforPwhenu∈Factor(P),canberealizedinO(M+m2)timeusingO(M+m2)space.
Proof.Weusethetechniquementionedabove.Wecanignoretheinfinitestrings.Thatis,wecanignorethetreesinwhicharootisonaloop.ConsidertheproblemofdeterminingwhetheruRisafactorofPR.ItcanbesolvedinO(min{|u|,m})timeusingthesuffixtrieforPR.IfuRisafactorofPR,thenodeuofthesuffixtrieforPcanbedetermineddirectlyfromthenodeuRofthesuffixtrieforPRassumingatrivialone-to-onemappingbetweenthetwosuffixtries,whichcanbecomputedinO(m2)time.
Lemma5.Thefunctionwhichtakesasinput(q,a)∈Q2×Bsuchthatu=λG(q,a)isfiniteandreturnsinO(1)timethevalueδKMP(0,u),canberealizedinO(M+m)timeusingO(M+m)space.
Proof.Weusethetechniquementionedaboveagain.WehavetoconsidertheproblemoffindingthelengthoflongestsuffixofuthatisalsoaprefixofP.ThisisequivalenttofindingthelengthoflongestprefixofuRthatisalsoasuffixofPR.ItissolvedinO(min{|u|,m})timeusingthesuffixtreeforPR.Wecanignorethetreesinwhicharootisonaloop.Theorem2followsfromthelemmasabove.5.3
ProofofTheorem3
AccordingtowhetherapatternoccurrencecoverstheboundarybetweenthestringsP[1:j]andu,wecanpartitionthesetλKMP(j,u)intotwodisjointsubsetsasfollows.
˜)∪X(u),λKMP(j,u)=λKMP(j,uwhere
X(u)=|P|≤i≤|u|Pisasuffixofu[1:i],
andu˜isthelongestprefixofuthatisalsoapropersuffixofP.Let
Y(j,)=OccP,P[1:j]·P[m−+1:m]j,
˜)=wheredenotestheelement-wisesubtraction.ItiseasytoseeλKMP(j,u
Y(j,|u˜|).ItfollowsfromLemma1thatthesetY(j,)hasthefollowingproperty:Lemma6.IfY(j,)hasmorethantwoelements,itformsanarithmeticpro-gression,wherethestepisthesmallestperiodofP.
PatternMatchinginTextCompressedbyUsingAntidictionaries47
Lemma7.Thefunctionwhichtakesasinput(j,)∈{0,...,m}×{0,...,m}andreturnsinO(1)timeanO(1)spacerepresentationofthesetY(j,),canberealizedinO(m2)timeusingO(m2)space.
Proof.ItfollowsfromLemma6thatY(j,)canbestoredinO(1)spaceasapairoftheminimumandthemaximumvaluesinit.ThetablestoringtheminimumvaluesofY(j,)forall(j,)canbecomputedinO(m2)timeasstatedin[4].(TableN2definedin[4]satisfiesmin(Y(j,))=m−N2(j,).)ByreversingthepatternP,thetablethemaximumvaluesisalsocomputedinO(m2)time.ThesmallestperiodofPiscomputedinO(m)time.
Lemma8.Thefunctionwhichtakesasinput(q,a)∈Q2×BandreturnsinO(1)timethevalue|u˜|withu=λG(q,a),canberealizedinO(M+m)timeusingO(M+m)space.
Proof.WeshallconsidertheproblemoffindingthelengthoflongestsuffixofuRthatisalsoaproperprefixofPR.ThiscanbesolvedbyusingtheKMPautomatonforPR.Butwehavetoconsiderthecasewhereuissemi-infinite.Inthefinitestringcase,wemaketheautomatonstartattherootoftreewithinitialstate.Butintheinfinitestringcase,wemustchangethevalueoftheinitialstate.Letvbethestringspelledoutbytheloopstartingattherootofthetreebeingconsidered.Wemustpayattentiontothecasewhereapatternsuffixisalsoaprefixofthestringvwith>0.Todeterminethecorrectvalueoftheinitialstateattherootnode,wemaketheautomatongoaroundtheloopexactlytimesandstopitattherootnodethatisthestartingpoint,whereisthesmallestintegerwith·|v|>|P|.Thestateoftheautomatonatthatmomentisthedesiredvalue.
Lemma9.Thefunctionwhichtakesasinput(q,a)∈Q2×BandreturnsinO(1)timealinearsizerepresentationofthesetX(u)withu=λG(q,a),canberealizedinO(M+m)timeusingO(M+m)space.
Proof.ByusingtheKMPautomatonforthereversedpattern,wemarkthepredictnodesatwhichthepatternbegins.Supposethateverypredictnodehasapointertothenearestproperancestorthatismarked.SuchpointersarerealizedusingO(M)timeandspace.ThisenablesustogettheelementsofX(u)inO(|X(u)|)time.
Theorem3followsfromthelemmasabove.
6ConcludingRemarks
InthispaperwefocusedontheproblemofcompressedpatternmatchingforthetextcompressionusingantidictionariesproposedrecentlyCrochemoreetal.[8].Wepresentedanalgorithmwhichhasalineartimecomplexityproportional
48YusukeShibataetal.
tothecompressedtextlength,whenweexcludethepatternpreprocessing.Wearenowimplementingthealgorithmtoevaluateitsperformancefrompracticalviewpoints.In[16]weshowedthattheShift-AndapproachiseffectiveinthecompressedpatternmatchingfortheLZWcompression.WethinkthattheShift-AndapproachwillbesubstitutedfortheKMPautomatonapproachpresentedinthispaperandshowagoodperformaceinpracticewhenthepatternlengthmisnotsolarge,saym≤32.
Foralongpatternwecanalsoconsiderthefollowingmethod.Letkbethelengthofthelongestforbiddenwordintheantidictionary.Byusingthesyncronizingproperty[8],weobtain:
Lemma10.If|P|≥k−1,thenδ(u,P)=δ(ε,P)foranystateuinQsuchthatδ(u,P)∈M.
Letp=δ(ε,P).Sincep∈MimpliesthatPcannotoccurinT,wecanassumep∈M.IfpisinQ1,thenletq=Terminal(p).Otherwise,letq=p.WecanmonitorwhetherthestateofA(M)isinstatepbyusingthefunctionδGtocheckG(M)isinstateq.Ifso,weshallconfirmit.Ourpreliminaryexperimentssuggestthatthissearchmethodisefficientinpractice.
References
1.A.V.AhoandM.Corasick.Efficientstringmatching:Anaidtobibliographicsearch.Comm.ACM,18(6):333–340,1975.
2.A.AmirandG.Benson.Efficienttwo-dimensionalcompressedmatching.InProc.DataCompressionConference’92,page279,1992.
3.A.AmirandG.Benson.Two-dimensionalperiodicityanditsapplication.InProc.3rdAnn.ACM-SIAMSymp.onDiscreteAlgorithms,pages440–452,1992.
4.A.Amir,G.Benson,andM.Farach.Letsleepingfileslie:PatternmatchinginZ-compressedfiles.JournalofComputerandSystemSciences,52:299–307,1996.5.A.Amir,G.Benson,andM.Farach.Optimaltwo-dimensionalcompressedmatch-ing.JournalofAlgorithms,24(2):354–379,1997.
6.A.Amir,G.M.Landau,andU.Vishkin.Efficientpatternmatchingwithscaling.JournalofAlgorithms,13(1):2–32,1992.
7.M.Crochemore,F.Mignosi,andA.Restivo.Minimalforbiddenwordsandfactorautomata.InL.Brim,J.Gruska,andJ.Zlatuska,editors,Proc.23rdInternationialSymp.onMathematicalFoundationsofComputerScience,volume1450ofLectureNotesinComputerScience,pages665–673.Springer-Verlag,1998.
8.M.Crochemore,F.Mignosi,A.Restivo,andS.Salemi.Textcompressionusingantidictionaries.TechnicalReportIGM-98-10,InstitutGaspard-Monge,1998.9.E.S.deMoura,G.Navarro,N.Ziviani,andR.Baeza-Yates.Directpatternmatch-ingoncompressedtext.InProc.5thInternationalSymp.onStringProcessingandInformationRetrieval,pages90–95.IEEEComputerSociety,1998.
10.E.S.deMoura,G.Navarro,N.Ziviani,andR.Baeza-Yates.Fastsequencial
searchingoncompressedtextsallowingerrors.InProc.21stAnn.InternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages298–306.YorkPress,1998.
11.T.Eilam-TzoreffandU.Vishkin.Matchingpatternsinstringssubjecttomulti-lineartransformations.TheoreticalComputerScience,60(3):231–254,1988.
PatternMatchinginTextCompressedbyUsingAntidictionaries49
12.M.FarachandM.Thorup.String-matchinginLempel-Zivcompressedstrings.In
Proc.27thAnn.ACMSymp.onTheoryofComputing,pages703–713,1995.
13.S.Fukamachi,T.Shinohara,andM.Takeda.Stringpatternmatchingforcom-presseddatausingvariablelengthcodes.Submitted,1998.14.L.G¸asieniec,M.Karpinski,W.Plandowski,andW.Rytter.Efficientalgorithms
forLempel-Zivencoding.InProc.4thScandinavianWorkshoponAlgorithmThe-ory,volume1097ofLectureNotesinComputerScience,pages392–403.Springer-Verlag,1996.
15.M.Karpinski,W.Rytter,andA.Shinohara.Anefficientpattern-matchingalgo-rithmforstringswithshortdescriptions.NordicJournalofComputing,4:172–186,1997.
16.T.Kida,M.Takeda,A.Shinohara,andS.Arikawa.Shift-Andapproachtopattern
matchinginLZWcompressedtext.InProc.10thAnn.Symp.onCombinatorialPatternMatching,LectureNotesinComputerScience.Springer-Verlag,1999.toappear.
17.T.Kida,M.Takeda,A.Shinohara,M.Miyazaki,andS.Arikawa.Multiplepattern
matchinginLZWcompressedtext.InProc.DataCompressionConference’98,pages103–112.IEEEComputerSociety,1998.
18.U.Manber.Atextcompressionschemethatallowsfastsearchingdirectlyin
thecompressedfile.InProc.5thAnn.Symp.onCombinatorialPatternMatching,volume807ofLectureNotesinComputerScience,pages113–124.Springer-Verlag,1994.
19.M.Miyazaki,S.Fukamachi,M.Takeda,andT.Shinohara.Speedingupthepattern
matchingmachineforcompressedtexts.TransactionsofInformationProcessingSocietyofJapan,39(9):2638–2648,1998.(inJapanese).
20.M.Miyazaki,A.Shinohara,andM.Takeda.Animprovedpatternmatchingal-gorithmforstringsintermsofstraight-lineprograms.InProc.8thAnn.Symp.onCombinatorialPatternMatching,volume1264ofLectureNotesinComputerScience,pages1–11.Springer-Verlag,1997.
21.Y.Shibata,T.Kida,S.Fukamachi,M.Takeda,A.Shinohara,T.Shinohara,and
S.Arikawa.Bytepairencoding:atextcompressionschemethatacceleratespatternmatching.TechnicalReportDOI-TR-161,DepartmentofInformatics,KyushuUniversity,April1999.
22.M.Takeda.Patternmatchingmachinefortextcompressedusingfinitestatemodel.
TechnicalReportDOI-TR-142,DepartmentofInformatics,KyushuUniversity,October1997.
因篇幅问题不能全部显示,请点此查看更多更全内容