Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorialsComments and corrections gratefully received. . Information GainAndrew W. MooreProfessorSchool of Computer ScienceCarnegie Mellon Universitywww.cs.cmu.edu/~awmawm@cs.cmu.edu412-268-7599Copyright © 2001, 2003, Andrew W. MooreBitsYou are watching a set of independent random samples of XYou see that X has four possible valuesP(X=A) = 1/4P(X=B) = 1/4P(X=C) = 1/4P(X=D) = 1/4So you might see: BAACBADCDADDDA…You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11)0100001001001110110011111100…Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 21Fewer BitsSomeone tells you that the probabilities are not equalP(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8It’s possible……to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How?Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 3Fewer BitsSomeone tells you that the probabilities are not equalP(X=A) = 1/2P(X=B) = 1/4P(X=C) = 1/8P(X=D) = 1/8It’s possible……to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How?A0B10C110D111(This is just one of several ways)Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 42Fewer BitsSuppose there are three equally likely values…P(X=A) = 1/3P(X=B) = 1/3P(X=C)= 1/3Here’s a naïve coding, costing 2 bits per symbolA00B01C10Can you think of a coding that would need only 1.6 bits per symbol on average?In theory, it can in fact be done with 1.58496 bits per symbol.Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 5General CaseSuppose X can have one of mvalues… V1, V2, … VmP(X=V1) = p1P(X=V2) = p2….P(X=Vm) = pmWhat’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’sH(X)=−p1log2p1−p2log2p2−K−pmlog2pmm=−∑pjlog2pjj=1H(X) = The entropy of X•“High Entropy” means X is from a uniform (boring) distribution•“Low Entropy” means X is from varied (peaks and valleys) distributionCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 63General CaseSuppose X can have one of mvalues… V1, V2, … VmP(X=V1) = p1P(X=V2) = p2….P(X=VWhat’s the smallest possible number of bits, on average, per A histogram of the m) = pmsymbol, needed to transmit a stream of symbols drawn from frequency distribution of X’s distribution? It’sA histogram of the values of X would have frequency distribution of many lows and one or H(X)=values of X would be flat−p1log2p1−p2logtwo highs2p2−K−pmlog2pm=−∑mpjlog2pjj=1H(X) = The entropy of X•“High Entropy” means X is from a uniform (boring) distribution•“Low Entropy” means X is from varied (peaks and valleys) distributionCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 7General CaseSuppose X can have one of mvalues… V1, V2, … VmP(X=V1) = p1P(X=V2) = p2….P(X=V) = pWhat’s the smallest possible number of bits, on average, per A histogram of the mmsymbol, needed to transmit a stream of symbols drawn from frequency distribution of X’s distribution? It’sA histogram of the values of X would have frequency distribution of many lows and one or H(X)=values of X would be flat−p1log2p1−p2log2ptwo highs2−K−pmlog2pmm=−∑p..and so the values jlog2pjj=1sampled from it would ..and so the values H(X) = The entropy of Xbe all over the placesampled from it would be more predictable•“High Entropy” means X is from a uniform (boring) distribution•“Low Entropy” means X is from varied (peaks and valleys) distributionCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 84Entropy in a nut-shellLow EntropyHigh EntropyCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 9Entropy in a nut-shellLow EntropyHigh Entropy..the values (locations ..the values (locations of of soup) sampled soup) unpredictable... entirely from within almost uniformly sampled the soup bowlthroughout our dining roomCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 105Specific Conditional Entropy H(Y|X=v)Suppose I’m trying to predict output Y and I have input XX = College MajorLet’s assume this reflects the true Y = Likes “Gladiator”probabilitiesXYE.G. From this data we estimateMathYes•P(LikeG= Yes) = 0.5HistoryNo•P(Major = Math & LikeG= No) = 0.25CSYesMathNo•P(Major = Math) = 0.5MathNo•P(LikeG= Yes | Major = History) = 0CSYesNote:HistoryNo•H(X) = 1.5MathYes•H(Y) = 1Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 11Specific Conditional Entropy H(Y|X=v)X = College MajorDefinition of Specific Conditional Y = Likes “Gladiator”Entropy:H(Y |X=v) = The entropy of YXYamong only those records in which MathYesXhas value vHistoryNoCSYesMathNoMathNoCSYesHistoryNoMathYesCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 126Specific Conditional Entropy H(Y|X=v)X = College MajorDefinition of Specific Conditional Y = Likes “Gladiator”Entropy:H(Y |X=v) = The entropy of YXYamong only those records in which MathYesXhas value vHistoryNoExample:CSYesMathNo•H(Y|X=Math) = 1MathNo•H(Y|X=History) = 0CSYes•H(Y|X=CS) = 0HistoryNoMathYesCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 13Conditional Entropy H(Y|X)X = College MajorDefinition of Conditional Y = Likes “Gladiator”Entropy:XYH(Y |X) = The average specific MathYesconditional entropy of YHistoryNo= if you choose a record at random what CSYeswill be the conditional entropy of Y, MathNoconditioned on that row’s value of XMathNoCSYes= Expected number of bits to transmit XYif both sides will know the value of HistoryNoMathYes= Σj Prob(X=vj) H(Y| X = vj)Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 147Conditional EntropyX = College MajorDefinition of Conditional Entropy:Y = Likes “Gladiator”H(Y|X) = The average conditional entropy of Y= ΣXYjProb(X=vj) H(Y| X = vj)MathYesExample:HistoryNoCSYesvjProb(X=vj)H(Y| X = vj)MathNoMath0.51MathNoHistory0.250CSYesHistoryNoCS0.250MathYesH(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 15Information GainX = College MajorDefinition of Information Gain:Y = Likes “Gladiator”IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of XYthe line knew X?MathYesIG(Y|X)= H(Y) -H(Y| X)HistoryNoCSYesExample:MathNo•H(Y) = 1MathNoCSYes•H(Y|X) = 0.5HistoryNo•Thus IG(Y|X) = 1 –0.5 = 0.5MathYesCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 168Information Gain ExampleCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 17Another exampleCopyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 189Relative Information GainX = College MajorDefinition of Relative Information Y = Likes “Gladiator”Gain:RIG(Y|X) = I must transmit Y, what fraction of the bits on average would XYit save me if both ends of the line MathYesknew X?HistoryNoRIG(Y|X)= H(Y) -H(Y| X) / H(Y)CSYesMathNoExample:MathNo•H(Y|X) = 0.5CSYesHistoryNo•H(Y) = 1MathYes•Thus IG(Y|X) = (1 –0.5)/1 = 0.5Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 19What is Information Gain used for?Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find…•IG(LongLife| HairColor) = 0.01•IG(LongLife| Smoker) = 0.2•IG(LongLife| Gender) = 0.25•IG(LongLife| LastDigitOfSSN) = 0.00001IG tells you how interesting a 2-d contingency table is going to be.Copyright © 2001, 2003, Andrew W. MooreInformation Gain: Slide 2010