Bayes' Theorem (Stanford Encyclopedia of Philosophy) Cite this entry Search the SEP • Advanced Search • Tools • RSS FeedTable of Contents• What's New• Archives• Projected ContentsEditorial Information• About the SEP• Editorial Board• How to Cite the SEP• Special CharactersSupport the SEPContact the SEP ©Metaphysics Research Lab,CSLI,Stanford University Open access to the SEP is made possible by a world-wide funding initiative. Please Read How You Can Help Keep the Encyclopedia FreeBayes' TheoremFirst published Sat Jun 28, 2003; substantive revision Tue Sep 30, 2003Bayes' Theorem is a simple mathematical formula used for calculatingconditional probabilities. It figures prominently insubjectivist or Bayesian approaches to epistemology,statistics, and inductive logic. Subjectivists, who maintain thatrational belief is governed by the laws of probability, lean heavilyon conditional probabilities in their theories of evidence and theirmodels of empirical learning. Bayes' Theorem is central to theseenterprises both because it simplifies the calculation of conditionalprobabilities and because it clarifies significant features ofsubjectivist position. Indeed, the Theorem's central insight —that a hypothesis is confirmed by any body of data that its truthrenders probable — is the cornerstone of all subjectivistmethodology. 1. Conditional Probabilities and Bayes' Theorem2. Special Forms of Bayes' Theorem3. The Role of Bayes' Theorem in Subjectivist Accounts of Evidence4. The Role of Bayes' Theorem in Subjectivist Models of Learning Bibliography Other Internet Resources Related Entries1. Conditional Probabilities and Bayes' Theorem The probability of a hypothesis H conditional on a givenbody of data E is the ratio of the unconditional probabilityof the conjunction of the hypothesis with the data to theunconditional probability of the data alone.(1.1) Definition. The probability of H conditional on E isdefined as PE(H) =P(H & E)/P(E),provided that both terms of this ratio exist and P(E) > 0.[1] To illustrate, suppose J. Doe is a randomly chosen American who was aliveon January 1, 2000. According to the United States Center for DiseaseControl, roughly 2.4 million of the 275 million Americans alive on thatdate died during the 2000 calendar year. Among the approximately 16.6million senior citizens (age 75 or greater) about 1.36 million died. Theunconditional probability of the hypothesis that our J. Doe died during2000, H, is just the population-wide mortality rateP(H) = 2.4M/275M = 0.00873. To find the probabilityof J. Doe's death conditional on the information, E, that heor she was a senior citizen, we divide the probability that he or she wasa senior who died, P(H & E) = 1.36M/275M = 0.00495, by the probability that he or she was a senior citizen, P(E) = 16.6M/275M = 0.06036. Thus, the probability of J. Doe's death given that he or she was a senior isPE(H) = P(H &E)/P(E) = 0.00495/0.06036 = 0.082. Notice how thesize of the total population factors out of this equation, so thatPE(H) is just the proportion of seniorswho died. One should contrast this quantity, which gives the mortalityrate among senior citizens, with the "inverse" probability of Econditional on H, PH(E) =P(H & E)/P(H) =0.00495/0.00873 = 0.57, which is the proportion of deaths in thetotal population that occurred among seniors. Here are some straightforward consequences of (1.1):Probability. PE is aprobability function.[2]Logical Consequence. If E entails H,then PE(H) = 1.Preservation of Certainties. If P(H) = 1,then PE(H) = 1.Mixing. P(H) =P(E)PE(H) + P(~E)P~E(H).[3]The most important fact about conditional probabilities is undoubtedlyBayes' Theorem, whose significance was first appreciated bythe British cleric Thomas Bayes in his posthumously publishedmasterwork, "An Essay Toward Solving a Problem in the Doctrine ofChances" (Bayes 1764). Bayes' Theorem relates the "direct"probability of a hypothesis conditional on a given body of data,PE(H), to the "inverse"probability of the data conditional on the hypothesis,PH(E).(1.2)Bayes' Theorem. PE(H) = [P(H)/P(E)] PH(E) In an unfortunate, but now unavoidable, choice of terminology,statisticians refer to the inverse probabilityPH(E) as the "likelihood" ofH on E. It expresses the degree to which thehypothesis predicts the data given the background informationcodified in the probability P. In the example discussed above, the condition that J. Doe died during 2000is a fairly strong predictor of senior citizenship. Indeed, the equationPH(E) = 0.57 tells us that 57% of thetotal deaths occurred among seniors that year. Bayes' theorem letsus use this information to compute the "direct" probability of J. Doedying given that he or she was a senior citizen. We do this bymultiplying the "prediction term"PH(E) by the ratio of the total numberof deaths in the population to the number of senior citizens in thepopulation, P(H)/P(E) = 2.4M/16.6M =0.144. The result is PE(H) = 0.57 ×0.144 = 0.082, just as expected. Though a mathematical triviality, Bayes' Theorem is of great valuein calculating conditional probabilities because inverse probabilitiesare typically both easier to ascertain and less subjective than directprobabilities. People with different views about the unconditionalprobabilities of E and H often disagree aboutE's value as an indicator of H. Even so, they canagree about the degree to which the hypothesis predicts the data ifthey know any of the following intersubjectively available facts: (a)E's objective probability given H, (b) thefrequency with which events like E will occur if His true, or (c) the fact that H logically entailsE. Scientists often design experiments so that likelihoodscan be known in one of these "objective" ways. Bayes' Theorem thenensures that any dispute about the significance of the experimentalresults can be traced to "subjective" disagreements about theunconditional probabilities of H and E. When both PH(E) andP~H(E) are known an experimenterneed not even know E's probability to determine a value forPE(H) using Bayes' Theorem.(1.3)Bayes' Theorem (2nd form).[4]PE(H) = P(H)PH(E) / [P(H)PH(E)+ P(~H)P~H(E)] In this guise Bayes' theorem is particularly useful for inferringcauses from their effects since it is often fairly easy to discern theprobability of an effect given the presence or absence of a putativecause. For instance, physicians often screen for diseases of knownprevalence using diagnostic tests of recognized sensitivityand specificity. The sensitivity of a test, its "truepositive" rate, is the fraction of times that patients with thedisease test positive for it. The test's specificity, its "truenegative" rate, is the proportion of healthy patients who testnegative. If we let H be the event of a given patient havingthe disease, and E be the event of her testing positive forit, then the test's specificity and sensitivity are given by thelikelihoods PH(E) andP~H(~E), respectively,and the "baseline" prevalence of the disease in the population isP(H). Given these inputs about the effects of thedisease on the outcome of the test, one can use (1.3) to determine theprobability of disease given a positive test. For a moredetailed illustration of this process, see Example 1 in the Supplementary Document "Examples, Tables, and Proof Sketches".2. Special Forms of Bayes' TheoremBayes' Theorem can be expressed in a variety of forms that are usefulfor different purposes. One version employs what Rudolf Carnap calledthe relevance quotient or probability ratio (Carnap1962, 466). This is the factor PR(H,E) =PE(H)/P(H)by which H's unconditional probability must be multiplied toget its probability conditional on E. Bayes' Theorem isequivalent to a simple symmetry principle for probability ratios.(1.4)Probability Ratio Rule.PR(H, E) = PR(E, H) The term on the right provides one measure of the degree to whichH predicts E. If we think of P(E) asexpressing the "baseline" predictability of E given thebackground information codified in P, and ofPH(E) as E'spredictability when H is added to this background, thenPR(E, H) captures the degree towhich knowing H makes E more or less predictablerelative to the baseline: PR(E, H) =0 means that H categorically predicts ~E;PR(E, H) = 1 means that addingH does not alter the baseline prediction at all;PR(E, H) =1/P(E) means that H categoricallypredicts E. Since P(E)) =PT(E)) whereT is any truth of logic, we can think of (1.4) astelling us that The probability of a hypothesis conditional on a body of data isequal to the unconditional probability of the hypothesis multiplied bythe degree to which the hypothesis surpasses a tautology as apredictor of the data. In our J. Doe example, PR(H, E) isobtained by comparing the predictability of senior status given thatJ. Doe died in 2000 to its predictability given no informationwhatever about his or her mortality. Dividing the former "predictionterm" by the latter yields PR(H, E) =PH(E)/P(E) =0.57/0.06036 = 9.44. Thus, as a predictor of senior status in 2000,knowing that J. Doe died is more than nine times better than notknowing whether she lived or died. Another useful form of Bayes' Theorem is the Odds Rule. Inthe jargon of bookies, the "odds" of a hypothesis is its probabilitydivided by the probability of its negation: O(H) =P(H)/P(~H). So, for example, aracehorse whose odds of winning a particular race are 7-to-5 has a7/12 chance of winning and a 5/12 chance of losing. Tounderstand the difference between odds and probabilities it helps tothink of probabilities as fractions of the distance betweenthe probability of a contradiction and that of a tautology, so thatP(H) = p means that H is ptimes as likely to be true as a tautology. In contrast, writingO(H) = [P(H) −P(F)]/[P(T)− P(H)] (where F is somelogical contradiction) makes it clear that O(H)expresses this same quantity as the ratio of the amount by whichH's probability exceeds that of a contradiction to theamount by which it is exceeded by that of a tautology. Thus, thedifference between "probability talk" and "odds talk" corresponds tothe difference between saying "we are two thirds of the way there" andsaying "we have gone twice as far as we have yet to go." The analogue of the probability ratio is the odds ratioOR(H, E) =OE(H)/O(H),the factor by which H's unconditional odds must be multipliedto obtain its odds conditional on E. Bayes' Theorem isequivalent to the following fact about odds ratios:(1.5)Odds Ratio Rule. OR(H, E) =PH(E)/P~H(E) Notice the similarity between (1.4) and (1.5). While each employs adifferent way of expressing probabilities, each shows howits expression for H's probability conditional onE can be obtained by multiplying its expression forH's unconditional probability by a factor involving inverseprobabilities. The quantity LR(H, E) =PH(E)/P~H(E)that appears in (1.5) is the likelihood ratio of Hgiven E. In testing situations like the one described inExample 1, the likelihood ratio is the test's true positive ratedivided by its false positive rate: LR =sensitivity/(1 − specificity). As with the probabilityratio, we can construe the likelihood ratio as a measure of the degreeto which H predicts E. Instead of comparingE's probability given H with its unconditionalprobability, however, we now compare it with its probabilityconditional on ~H. LR(H,E) is thus the degree to which the hypothesis surpasses itsnegation as a predictor of the data. Once more, Bayes' Theorem tellsus how to factor conditional probabilities into unconditionalprobabilities and measures of predictive power. The odds of a hypothesis conditional on a body of data is equalto the unconditional odds of the hypothesis multiplied by the degreeto which it surpasses its negation as a predictor of the data. In our running J. Doe example, LR(H,E) is obtained by comparing the predictability of seniorstatus given that J. Doe died in 2000 to its predictability giventhat he or she lived out the year. Dividing the former "predictionterm" by the latter yields LR(H, E)=PH(E)/P~H(E)= 0.57/0.056 = 10.12. Thus, as a predictor of senior status in 2000,knowing that J. Doe died is more than ten times better than knowingthat he or she lived. The similarities between the "probability ratio" and "odds ratio"versions of Bayes' Theorem can be developed further if we expressH's probability as a multiple of the probability of someother hypothesis H* using the relative probabilityfunction B(H, H*) =P(H)/P(H*). It should be clearthat B generalizes both P and O sinceP(H) = B(H, T) andO(H) = B(H, ~H). By comparingthe conditional and unconditional values of B we obtain theBayes' Factor: BR(H, H*; E) =BE(H,H*)/B(H, H*) =[PE(H)/PE(H*)]/[P(H)/P(H*)]. We can also generalize the likelihood ratio by settingLR(H, H*; E) =PH(E)/PH*(E).This compares E's predictability on the basis of Hwith its predictability on the basis of H*. We can use thesetwo quantities to formulate an even more general form of Bayes'Theorem.(1.6)Bayes' Theorem (General Form) BR(H, H*; E) =LR(H, H*; E) The message of (1.6) is this: The ratio of probabilities for two hypotheses conditional on abody of data is equal to the ratio their unconditional probabilitiesmultiplied by the degree to which the first hypothesis surpasses thesecond as a predictor of the data. The various versions of Bayes' Theorem differ only with respect tothe functions used to express unconditional probabilities(P(H), O(H), B(H)) andin the likelihood term used to represent predictive power(PR(E, H),LR(H, E),LR(H, H*; E)). In eachcase, though, the underlying message is the same:conditional probability = unconditional probability × predictive power (1.2) – (1.6) are multiplicative forms of Bayes' Theorem that usedivision to compare the disparities between unconditional andconditional probabilities. Sometimes these comparisons are bestexpressed additively by replacing ratios with differences.The following table gives the additive analogue of each ratio measure.Table 1RatioDifferenceProbability Ratio PR(H, E)= PE(H)/P(H)Probability Difference PD(H, E) =PE(H) −P(H) Odds Ratio OR(H, E) =OE(H)/O(H)Odds Difference OD(H, E) =OE(H) −O(H) Bayes' Factor BR(H, H*; E) =BE(H,H*)/B(H, H*)Bayes' Difference BD(H, H*; E) =BE(H, H*) −B(H, H*) We can use Bayes' theorem to obtain additive analogues of (1.4) –(1.6), which are here displayed along with their multiplicativecounterparts:Table 2RatioDifference(1.4) PR(H, E)= PR(E, H)= PH(E)/P(E) PD(H, E)= P(H) [PR(E, H) − 1](1.5) OR(H, E)= LR(H, E)= PH(E)/P~H(E) OD(H, E) = O(H)[OR(H, E) − 1] (1.6) BR(H, H*; E) =LR(H, H*; E) =PH(E)/PH*(E) BD(H, H*; E) =B(H, H*) [BR(H,H*; E) − 1] Notice how each additive measure is obtained by multiplyingH's unconditional probability, expressed on the relevantscale, P, O or B, by the associatedmultiplicative measure diminished by 1. While the results of this section are useful to anyone who employsthe probability calculus, they have a special relevance forsubjectivist or "Bayesian" approaches to statistics,epistemology, and inductive inference.[5] Subjectivists lean heavily on conditional probabilities in theirtheory of evidential support and their account of empiricallearning. Given that Bayes' Theorem is the single most important factabout conditional probabilities, it is not at all surprising that itshould figure prominently in subjectivist methodology.3. The Role of Bayes' Theorem in Subjectivist Accounts of EvidenceSubjectivists maintain that beliefs come in varying gradations ofstrength, and that an ideally rational person's graded beliefs can berepresented by a subjective probability functionP. For each hypothesis H about which the person has afirm opinion, P(H) measures her level of confidence(or "degree of belief") in H's truth.[6] Conditional beliefs are represented by conditional probabilities, sothat PE(H) measures the person'sconfidence in H on the supposition that E is a fact.[7] One of the most influential features of the subjectivist program isits account of evidential support. The guiding ideas of thisBayesian confirmation theory are these:Confirmational Relativity. Evidential relationships mustbe relativized to individuals and their degrees of belief.Evidence Proportionism.[8] A rational believer will proportion her confidence in a hypothesisH to her total evidence for H, so that hersubjective probability for H reflects the overall balance ofher reasons for or against its truth.Incremental Confirmation.[9] A body of data provides incremental evidence for Hto the extent that conditioning on the data raises H'sprobability. The first principle says that statements about evidentiaryrelationships always make implicit reference to people and theirdegrees of belief, so that, e.g., "E is evidence forH" should really be read as "E is evidence forH relative to the information encoded in the subjectiveprobability P". According to evidence proportionism, a subject's level of confidencein H should vary directly with the strength of her evidencein favor of H's truth. Likewise, her level of confidence inH conditional on E should vary directly with thestrength of her evidence for H's truth when this evidence isaugmented by the supposition of E. It is a matter of somedelicacy to say precisely what constitutes a person's evidence,[10] and to explain how her beliefs should be "proportioned" to it.Nevertheless, the idea that incremental evidence is reflected indisparities between conditional and unconditional probabilities onlymakes sense if differences in subjective probability mirrordifferences in total evidence. An item of data provides a subject with incremental evidencefor or against a hypothesis to the extent that receiving the dataincreases or decreases her total evidence for the truth of thehypothesis. When probabilities measure total evidence, the incrementof evidence that E provides for H is a matter of thedisparity between PE(H) andP(H). When odds are used it is a matter of thedisparity between OE(H) andO(H). See Example 2 in thesupplementary document "Examples, Tables, and Proof Sketches", whichillustrates the difference between total and incremental evidence, andexplains the "baserate fallacy" that can result from failing toproperly distinguish the two. It will be useful to distinguish two subsidiary concepts related tototal evidence.The net evidence in favor of H is the degree to which asubject's total evidence in favor of H exceeds her totalevidence in favor of ~H.The balance of total evidence for H over H* is the degreeto which a subject's total evidence in favor of H exceeds hertotal evidence in favor of H*. The precise content of these notions will depend on how totalevidence is understood and measured, and on how disparities in totalevidence are characterized. For example, if total evidence is givenin terms of probabilities and disparities are treated as ratios, thenthe net evidence for H isP(H)/P(~H). If total evidenceis expressed in terms of odds and differences are used to expressdisparities, then the net evidence for H will beO(H) − O(~H). Readers mayconsult Table 3 (in the supplementary document) for a complete list of the possibilities. As these remarks make clear, one can interpret O(H)either as a measure of net evidence or as a measure of total evidence.To see the difference, imagine that 750 red balls and 250 black ballshave been drawn at random and with replacement from an urn known tocontain 10,000 red or black balls. Assuming that this is our onlyevidence about the urn's contents, it is reasonable to setP(Red) = 0.75 and P(~Red) = 0.25. Ona probability-as-total-evidence reading, these assignments reflectboth the fact that we have a great deal of evidence in favor ofRed (namely, that 750 of 1,000 draws were red) and the factthat we have also have some evidence against it (namely, that 250 ofthe draws were black). The net evidence for Red isthen the disparity between our total evidence for Red and ourtotal evidence against Red. This can be expressedmultiplicatively by saying that we have seen three times as many reddraws as black draws, which is just to say that O(Red)= 3. Alternatively, we can use O(Red) as a measure ofthe total evidence by taking our evidence for Red to be theratio of red to black draws, rather than the total number of reddraws, and our evidence for ~Red to be the ratio of blackballs to red balls, rather than the total number of black draws.While the decision whether to use O as a measure total or netevidence makes little difference to questions about theabsolute amount of total evidence for a hypothesis (sinceO(H) is an increasing function ofP(H)), it can make a major difference when one isconsidering the incremental changes in total evidence broughtabout by conditioning on new information. Philosophers interested in characterizing correct patterns ofinductive reasoning and in providing "rational reconstructions" ofscientific methodology have tended to focus on incremental evidence ascrucial to their enterprise. When scientists (or ordinary folk) saythat E supports or confirms H what they generallymean is that learning of E's truth will increase the totalamount of evidence for H's truth. Since subjectivistscharacterize total evidence in terms of subjective probabilities orodds, they analyze incremental evidence in terms of changes in thesequantities. On such views, the simplest way to characterize thestrength of incremental evidence is by making ordinal comparisons ofconditional and unconditional probabilities or odds. (2.1)A Comparative Account of Incremental Evidence. Relative to a subjective probability function P, E incrementally confirms (disconfirms, is irrelevant to)H if and only if PE(H) isgreater than (less than, equal to) P(H). H receives a greater increment (or lesser decrement) ofevidential support from E than from E* if and onlyif PE(H) exceedsPE*(H). Both these equivalences continue to hold with probabilities replacedby odds. So, this part of the subjectivist theory of evidence doesnot depend on how total evidence is measured. Bayes' Theorem helps to illuminate the content of (2.1) by making itclear that E's status as incremental evidence for His enhanced to the extent that H predicts E. Thisobservation serves as the basis for the following conclusions aboutincremental confirmation (which hold so long as 1 > P(H), P(E) > 0). (2.1a) If E incrementally confirmsH, then H incrementally confirms E. (2.1b) If E incrementally confirmsH, then E incrementally disconfirms~H. (2.1c) If H entails E, then Eincrementally confirms H. (2.1d) If PH(E) =PH(E*), then H receivesmore incremental support from E than from E* if andonly if E is unconditionally less probable thanE*. (2.1e) Weak Likelihood Principle.E provides incremental evidence for H if and only ifPH(E) >P~H(E). More generally, ifPH(E) >PH*(E) andP~H(~E) ≥P~H*(~E), then E providesmore incremental evidence for H than for H*. (2.1a) tells us that incremental confirmation is a matter ofmutual reinforcement: a person who sees E asevidence for H invests more confidence in the possibilitythat both propositions are true than in either possibility in whichonly one obtains. (2.1b) says that relevant evidence must be capable of discriminatingbetween the truth and falsity of the hypothesis under test. (2.1c) provides a subjectivist rationale for thehypothetico-deductive model of confirmation. According tothis model, hypotheses are incrementally confirmed by any evidencethey entail. While subjectivists reject the idea that evidentiaryrelations can be characterized in a belief-independent manner —Bayesian confirmation is always relativized to a person andher subjective probabilities — they seek to preserve the basicinsight of the H-D model by pointing out that hypotheses areincrementally supported by evidence they entail for anyone who hasnot already made up her mind about the hypothesis or theevidence. More precisely, if H entails E, thenPE(H) =P(H)/P(E), which exceedsP(H) whenever 1 > P(E),P(H) > 0. This explains why scientists so oftenseek to design experiments that fit the H-D paradigm. Even whenevidentiary relations are relativized to subjective probabilities,experiments in which the hypothesis under test entails the data willbe regarded as evidentially relevant by anyone who has notyet made up his mind about the hypothesis or the data. Thedegree of incremental confirmation will vary among peopledepending on their prior levels of confidence in H andE , but everyone will agree that the data incrementallysupports the hypothesis to at least some degree. Subjectivists invoke (2.1d) to explain why scientists so often regardimprobable or surprising evidence as having more confirmatorypotential than evidence that is antecedently known. While it is nottrue in general that improbable evidence has more confirmingpotential, it is true that E's incremental confirming powerrelative to H varies inversely with E'sunconditional probability when the value of the inverseprobability PH(E) is heldfixed. If H entails both E and E*,say, then Bayes' Theorem entails that the least probable of the twosupports H more strongly. For example, even if heart attacksare invariably accompanied by severe chest pain and shortness ofbreath, the former symptom is far better evidence for a heart attackthan the latter simply because severe chest pain is so much lesscommon than shortness of breath. (2.1e) captures one core message of Bayes' Theorem for theories ofconfirmation. Let's say that H is uniformly betterthan H* as predictor of E's truth-value when (a)H predicts E more strongly than H* does,and (b) ~H predicts ~E more strongly than~H* does. According to the weak likelihood principle,hypotheses that are uniformly better predictors of the data are bettersupported by the data. For example, the fact that little Johnny is aChristian is better evidence for thinking that his parents areChristian than for thinking that they are Hindu because (a) a farhigher proportion of Christian parents than Hindu have Christianchildren, and (b) a far higher proportion of non-Christian parentsthan non-Hindu parents have non-Christian children. Bayes' Theorem can also be used as the basis for developing andevaluating quantitative measures of evidential support. Theresults listed in Table 2 entail that all four of the functionsPR, OR, PD andOD agree with one another on the simplest question ofconfirmation: Does E provide incremental evidence forH? (2.2)Corollary. Each of the following is equivalent to the assertionthat E provides incremental evidence in favor of H:PR(H, E) > 1,OR(H, E) > 1,PD(H, E) > 0,OD(H, E) > 0. Thus, all four measures agree with the comparative account ofincremental evidence given in (2.1). Given all this agreement it should not be surprising thatPR(H, E),OR(H, E) andPD(H, E), have all been proposed asmeasures of the degree of incremental support that Eprovides for H.[11] While OD(H, E) has not beensuggested for this purpose, we will consider it for reasons ofsymmetry. Some authors maintain that one or another of thesefunctions is the unique correct measure of incremental evidence;others think it best to use a variety of measures that capturedifferent evidential relationships. While this is not the place toadjudicate these issues, we can look to Bayes' Theorem for help inunderstanding what the various functions measure and in characterizingthe formal relationships among them. All four measures agree in their conclusions about thecomparative amount of incremental evidence that differentitems of data provide for a fixed hypothesis. In particular,they agree ordinally about the following concepts derived fromincremental evidence:The effective increment of evidence[12] that E provides for H is the amount by which theincremental evidence that E provides for H exceedsthe incremental evidence that ~E provides for H.The differential in the incremental evidence thatE and E* provide for H is the amount bywhich the incremental evidence that E provides for Hexceeds the incremental evidence that E* provides forH. Effective evidence is a matter of the degree to which a person'stotal evidence for H depends on her opinion about E.When PE(H) andP~E(H) (orOE(H) andO~E(H)) are far apart the person'sbelief about E has a great effect on her belief aboutH: from her point of view, a great deal hangs on E'struth-value when it comes to questions about H's truth-value.A large differential in incremental evidence between E andE* tells us that learning E increases the subject'stotal evidence for H by a larger amount than learningE* does. Readers may consult Table 4 (in thesupplement) for quantitative measures of effective anddifferential evidence. The second clause of (2.1) tells us that E provides moreincremental evidence than E* does for H just in casethe probability of H conditional on E exceeds theprobability of H conditional on E*. It is then asimple step to show that all four measures of incremental supportagree ordinally on questions of effective evidence and ofdifferentials in incremental evidence. (2.3)Corollary. For any H, E* and E withpositive probability, the following are equivalent: E provides more incremental evidence than E*does for H PR(H, E) >PR(H, E*) OR(H, E) >OR(H, E*) PD(H, E) >PD(H, E*) OD(H, E) >OD(H, E*) The four measures of incremental support can disagree over thecomparative degree to which a single item of dataincrementally confirms two distinct hypotheses. Example 3, Example 4, and Example 5 (in the supplement) show the various ways in which thiscan happen. All the differences between the measures have ultimately to do with(a) whether the total evidence in favor of a hypothesisshould be measured in terms of probabilities or in terms of odds, and(b) whether disparities in total evidence are best capturedas ratios or as differences. Rows in the following table correspondto different measures of total evidence. Columns correspond todifferent ways of treating disparities.Table 5: Four measures of incremental evidenceRatioDifferenceP = TotalPR(H, E) = PE(H)/P(H)PD(H, E)= PE(H) −P(H)O = TotalOR(H, E) = OE(H)/O(H)OD(H, E)= OE(H) −O(H) Similar tables can be constructed for measures of net evidence andmeasures of balances in total evidence. See Table 5A in the supplement. We can use the various forms of Bayes' Theorem to clarify thesimilarities and differences among these measures by rewriting each ofthem in terms of likelihood ratios. Table 6: The four measures expressed in terms oflikelihood ratios RatioDifferenceP = Total PR(H, E) =LR(H, T;E) PD(H, E) =P(H)[LR(H, T;E) − 1] O = Total OR(H, E) =LR(H, ~H; E) OD(H, E)=O(H)[LR(H, ~H;E) − 1] This table shows that there are two differences between eachmultiplicative measure and its additive counterpart. First, thelikelihood term that appears in a given multiplicative measure isdiminished by 1 in its associated additive measure. Second, in eachadditive measure the diminished likelihood term is multiplied by anexpression for H's probability: P(H) orO(H), as the case may be. The first differencemarks no distinction; it is due solely to the fact that themultiplicative and additive measures employ a different zero pointfrom which to measure evidence. If we settle on the point ofprobabilistic independence PE(H) =P(H) as a natural common zero, and so subtract 1 fromeach multiplicative measure,[13] then equivalent likelihood terms appear in both columns. The real difference between the measures in a given row concerns theeffect of unconditional probabilities on relations of incrementalconfirmation. Down the right column, the degree to which Eprovides incremental evidence for H is directly proportionalto H's probability expressed in units ofP(T) or P(~H). In the leftcolumn, H's probability makes no difference to the amount ofincremental evidence that E provides for H oncePH(E) and eitherP(E) or P~H(E) are fixed.[14] In light of Bayes' Theorem, then, the difference between the ratiomeasures and then difference measures boils down to one question: Does a given piece of data provide a greater increment ofevidential support for a more probable hypothesis than it does for aless probable hypothesis when both hypotheses predict the data equallywell? The difference measures answer yes, the ratio measures answer no. Bayes' Theorem can also help us understand the difference betweenrows. The measures within a given row agree about the role ofpredictability in incremental confirmation. In the top rowthe incremental evidence that E provides for Hincreases linearly withPH(E)/P(E),whereas in the bottom row it increases linearly withPH(E)/P~H(E).Thus, when probabilities measure total evidence what matters is thedegree to which H exceeds T as a predictor ofE, but when odds measure total evidence it is the degree towhich H exceeds ~H as a predictor of E thatmatters. The central issue here concerns the status of the likelihood ratio.While everyone agrees that it should play a leading role in anyquantitative theory of evidence, there are conflicting views aboutprecisely what evidential relationship it captures. There are threepossible interpretations. Table 7: Three interpretations of the likelihoodratio Probability as total evidence reading PR(H, E) measures incrementalchange in total evidence. LR(H, E) measures incrementalchange in net evidence. LR(H, H*, E) measuresincremental change in the balance of evidence that E providesfor H over H* Odds as total evidence reading LR(H, E) measures incrementalchanges in total evidence. LR(H, E)2 measuresincremental change in net evidence. LR(H, H*;E)/LR(~H, ~H*;E) measures incremental change in the balance of evidencethat E provides for H over H*. "Likelihoodist" reading Neither P nor O measures total evidence becauseevidential relations are essentially comparative; they alwaysinvolve the balance of evidence. LR(H, E) measures the balanceof evidence that E provides for H overH*. LR(H, H*; E) measuresthe balance of evidence that E provides for H overH*. On the first reading there is no conflict whatsoever between usingprobability ratios and using likelihood ratios to measure evidence.Once we get clear on the distinctions between total evidence, netevidence and the balance of evidence, we see that each ofPR(H, E),LR(H, E) andLR(H, H*; E) measures animportant evidential relationship, but that the relationships theymeasure are importantly different. When odds measure total evidence neitherPR(H, E) norLR(H, H*; E) plays afundamental role in the theory of evidence. Changes in theprobability ratio for H given E only indicatechanges in incremental evidence in the presence of information aboutchanges in the probability ratio for ~H given E.Likewise, changes in the likelihood ratio for H andH* given E only indicate changes in the balance ofevidence in light of information about changes in the likelihood ratiofor ~H and ~H* given E. Thus, while eachof the two functions can figure as one component in a meaningfulmeasure of confirmation, neither tells us anything about incrementalevidence when taken by itself. The third view, "likelihoodism," is popular among non-Bayesianstatisticians. Its proponents deny evidence proportionism. Theymaintain that a person's subjective probability for a hypothesismerely reflects her degree of uncertainty about its truth; it need notbe tied in any way to the amount of evidence she has in its favor.[15] It is likelihood ratios, not subjective probabilities, which capturethe "scientifically meaningful" evidential relations. Here are twoclassic statements of the position. All the information which the data provide concerning the relativemerits of two hypotheses is contained in the likelihood ratio of thehypotheses on the data. (Edwards 1972, 30) The ‘evidential meaning’ of experimental results is characterizedfully by the likelihood function… Reports of experimental results inscientific journals should in principle be descriptions of likelihoodfunctions. (Brinbaum 1962, 272) On this view, everything that can be said about the evidential importof E for H is embodied in the followinggeneralization of the weak likelihood principle: The "Law of Likelihood". If H implies that theprobability of E is x, while H* impliesthat the probability of E is x*, then E isevidence supporting H over H* if and only ifx exceeds x*, and the likelihood ratio,x/x*, measures the strength of this support.(Hacking 1965, 106-109), (Royall 1997, 3) The biostatistician Richard Royall is a particularly lucid defenderof likelihoodism (Royall 1997). He maintains that any scientificallyrespectable concept of evidence must analyze the evidential impact ofE on H solely in terms of likelihoods; it should notadvert to anyone's unconditional probabilities for E orH. This is supposed to be because likelihoods are bothbetter known and more objective than unconditional probabilities.Royall argues strenuously against the idea that incremental evidencecan be measured in terms of the disparity between unconditional andconditional probabilities. Here is the gist of his complaint: Whereas [LR(H, H*; E)]measures the support for one hypothesis H relative to aspecific alternative H*, without regard either to the priorprobabilities of the two hypotheses or to what other hypotheses mightalso be considered, the law of changing probability [as measured byPR(H, E)] measures support forH relative to a specific prior distribution over Hand its alternatives... The law of changing probability is of limitedusefulness in scientific discourse because of its dependence on theprior probability distribution, which is generally unknown and/orpersonal. Although you and I agree (on the basis of the law oflikelihood) that given evidence supports H over H*,and H** over both H and H*, we mightdisagree about whether it is evidence supporting H (on thebasis of the law of changing probability) purely on the basis of ourdifferent judgments of the priori probability of H,H*, and H**. (Royall 1997, 10-11, with slightchanges in notation) Royall's point is that neither the probability ratio nor probabilitydifference will capture the sort of objective evidence required byscience because their values depend on the "subjective" termsP(E) and P(H), and not just on the"objective" likelihoods PH(E) andP~H(E). Whether one agrees with this assessment will be a matter ofphilosophical temperament, in particular of one's willingness totolerate subjective probabilities in one's account of evidentialrelations. It will also depend crucially on the extent to which oneis convinced that likelihoods are better known and more objective thanordinary subjective probabilities. Cases like the one envisioned inthe law of likelihood, where hypotheses deductively entails adefinite probability for the data, are relatively rare. So, unlessone is willing to adopt a theory of evidence with a very restrictedrange of application, a great deal will turn on how easy it is todetermine objective likelihoods in situations where the predictiveconnection from hypothesis to data is itself the result ofinductive inferences. However one comes down on theseissues, though, there is no denying that likelihood ratios will play acentral role in any probabilistic account of evidence. In fact, the weak likelihood principle (2.1e) encapsulates a minimalform of Bayesianism to which all parties can agree. This is clearestwhen it is restated in terms of likelihoods. (2.1e) The Weak Likelihood Principle. (expressed interms of likelihood ratios) If LR(H, H*; E)≥ 1 and LR(~H, ~H*;~E) ≥ 1, with one inequality strict, then Eprovides more incremental evidence for H than for H*and ~E provides more incremental evidence for ~Hthan for ~H*. Likelihoodists will endorse (2.1e) because the relationshipsdescribed in its antecedent depend only on inverse probabilities.Proponents of both the "probability" and "odds" interpretations oftotal evidence will accept (2.1e) because satisfaction of itsantecedent ensures that conditioning on E increasesH's probability and its odds strictly more than those ofH*. Indeed, the weak likelihood principle must be anintegral part of any account of evidential relevance that deserves thetitle "Bayesian". To deny it is to misunderstand the central messageof Bayes' Theorem for questions of evidence: namely, that hypothesesare confirmed by data they predict. As we shall see in the nextsection, this "minimal" form of Bayesianism figures importantly intosubjectivist models of learning from experience. 4. The Role of Bayes' Theorem in Subjectivist Models of LearningSubjectivists think of learning as a process of beliefrevision in which a "prior" subjective probability P isreplaced by a "posterior" probability Q that incorporates newlyacquired information. This process proceeds in two stages. First,some of the subject's probabilities are directly altered byexperience, intuition, memory, or some other non-inferentiallearning process. Second, the subject "updates" the rest of heropinions to bring them into line with her newly acquired knowledge. Many subjectivists are content to regard the initial belief changesas sui generis and independent of the believer's prior stateof opinion. However, as long as the first phase of the learningprocess is understood to be non-inferential, subjectivism can be madecompatible with an "externalist" epistemology that allows forcriticism of belief changes in terms the reliability of the causalprocesses that generate them. It can even accommodate the thought thatthe direct effect of experience might depend causally on thebeliever's prior probability. Subjectivists have studied the second, inferential phase of thelearning process in great detail. Here immediate belief changes areseen as imposing constraints of the form "the posterior probabilityQ has such-and-such properties." The objective is to discoverwhat sorts of constraints experience tends to impose, and to explainhow the person's prior opinions can be used to justify thechoice of a posterior probability from among the many that mightsatisfy a given constraint. Subjectivists approach the latter problemby assuming that the agent is justified in adopting whatever eligibleposterior departs minimally from her prior opinions. This isa kind of "no jumping to conclusions" requirement. We explain it hereas a natural result of the idea that rational learners shouldproportion their beliefs to the strength of the evidence they acquire. The simplest learning experiences are those in which the learnerbecomes certain of the truth of some proposition E aboutwhich she was previously uncertain. Here the constraint is that allhypotheses inconsistent with E must be assigned probabilityzero. Subjectivists model this sort of learning as simpleconditioning, the process in which the prior probability of eachproposition H is replaced by a posterior that coincides withthe prior probability of H conditional on E. (3.1)Simple Conditioning If a person with a "prior" such that 0 < P(E) < 1has a learning experience whose sole immediate effect is to raise hersubjective probability for E to 1, then her post-learning"posterior" for any proposition H should beQ(H) = PE(H). In short, a rational believer who learns for certain that Eis true should factor this information into her doxastic system byconditioning on it. Though useful as an ideal, simple conditioning is not widelyapplicable because it requires the learner to become absolutelycertain of E's truth. As Richard Jeffrey has argued(Jeffrey 1987), the evidence we receive is often too vague orambiguous to justify such "dogmatism." On more realistic models, thedirect effect of a learning experience will be to alter thesubjective probability of some proposition without raising it to 1 orlowering it to 0. Experiences of this sort are appropriately modeledby what has come to be called Jeffrey conditioning (thoughJeffrey's preferred term is "probability kinematics"). (3.2)Jeffrey Conditioning If a person with a prior such that 0 < P(E) < 1has a learning experience whose sole immediate effect is to change hersubjective probability for E to q, then herpost-learning posterior for any H should beQ(H) =qPE(H) + (1 −q)P~E(H). Obviously, Jeffrey conditioning reduces to simple conditioning whenq = 1. A variety of arguments for conditioning (simple or Jeffrey-style) canbe found in the literature, but we cannot consider them here.[16] There is, however, one sort of justification in which Bayes' Theoremfigures prominently. It exploits connections between belief revisionand the notion of incremental evidence to show that conditioning isthe only belief revision rule that allows learners tocorrectly proportion their posterior beliefs to the new evidence theyreceive. The key to the argument lies in marrying the "minimal" version ofBayesian expressed in the (2.1e) to a very modest "proportioning"requirement for belief revision rules. (3.3)The Weak Evidence Principle If, relative to a prior P, E provides at leastas much incremental evidence for H as for H*, and ifH is antecedently more probable than H*, thenH should remain more probable than H* after anylearning experience whose sole immediate effect is to increase theprobability of E. This requires an agent to retain his views about the relativeprobability of two hypotheses when he acquires evidence that supportsthe more probable hypothesis more strongly. It rules out obviouslyirrational belief revisions such as this: George is more confidentthat the New York Yankees will win the American League Pennant than heis that the Boston Rex Sox will win it, but he reverses himself whenhe learns (only) that the Yankees beat the Red Sox in last night'sgame. Combining (3.3) with minimal Bayesianism yields the following: (3.4)Consequence If a person's prior is such that LR(H,H*; E) ≥ 1, LR(~H,~H*; ~E) ≥ 1, and P(H) >P(H*), then any learning experience whose soleimmediate effect is to raise her subjective probability for Eshould result in a posterior such that Q(H) >Q(H*). On the reasonable assumption that Q is defined on the same setof propositions over which P is defined, this conditionsuffices to pick out simple conditioning as the uniquecorrect method of belief revision for learning experiences that makeE certain. It picks out Jeffrey conditioning as theunique correct method when learning merely alters one'ssubjective probability for E. The argument for theseconclusions makes use of the following two facts about probabilities.(3.5)Lemma If H and H* both entail E whenP(H) > P(H*), thenLR(H, H*; E) = 1 andLR(~H, ~H*; ~E) >1. Proof Sketch(3.6)Lemma Simple conditioning on E is the only rule for revisingsubjective probabilities that yields a posterior with the followingproperties for any prior such that P(E) >0:Q(E) = 1. Ordinal Similarity. If H and H* bothentail E, then P(H) ≥P(H*) if and only if Q(H) ≥Q(H*).Proof Sketch From here the argument for simple conditioning is a matter of using(3.4) and (3.5) to establish ordinal similarity. Suppose thatH and H* entail E and thatP(H) > P(H*). It follows from(3.5) that LR(H, H*; E) = 1and LR(~H, ~H*; ~E) >1. (3.4) then entails that any learning experience that raisesE's probability must result in a posterior withQ(H) > Q(H*). Thus, Q andP are ordinally similar with respect to hypotheses that entailH. If we go on to suppose that the learning experienceraises E's probability to 1, then (3.6) then guarantees thatQ arises from P by simple conditioning on E. The case for Jeffrey conditioning is similarly direct. Since theargument for ordinal similarity did not depend at all on theassumption that Q(E) = 1, we have really established (3.7)Corollary • If H and H* entail E, thenP(H) > P(H*) if and only ifQ(H) > Q(H*). • If H and H* entail ~E, thenP(H) > P(H*) if and only ifQ(H) > Q(H*). So, Q is ordinally similar to P both when restricted tohypotheses that entail E and when restricted to hypothesesthan entail ~E. Moreover, since dividing by positive numbersdoes not disturb ordinal relationships, it also follows that thatQE is ordinally similar to P whenrestricted to hypotheses that entail E, and thatQ~E is ordinally similar to P whenrestricted to hypotheses than entail ~E. SinceQE(E) = 1 =Q~E(E), (3.6) then entails: (3.8)Consequence For every proposition H,QE(H) =PE(H) andQ~E(H) =P~E(H) It is easy to show that (3.8) is necessary and sufficient forQ to arise from P by Jeffrey conditioning on E.Subject to the constraint Q(E) = q, itguarantees that Q(H) =qPE(H) + (1−q)P~E(H). The general moral is clear. The basic Bayesian insight embodied in the weak likelihoodprinciple (2.1e) entails that simple and Jeffrey conditioning onE are the only rational ways to revise beliefs inresponse to a learning experience whose sole immediate effect is toalter E's probability. While much more can be said about simple conditioning, Jeffreyconditioning and other forms of belief revision, these remarks shouldgive the reader a sense of the importance of Bayes' Theorem insubjectivist accounts of learning and evidential support. Though amathematical triviality, the Theorem's central insight — that ahypothesis is supported by any body of data it renders probable — lies at the heart of all subjectivist approaches to epistemology, statistics, and inductive logic.Bibliography Armendt, B. 1980. "Is There a Dutch Book Argument for ProbabilityKinematics?", Philosophy of Science 47, 583-588.Bayes, T. 1764. "An Essay Toward Solving a Problem in the Doctrineof Chances", Philosophical Transactions of the Royal Society ofLondon 53, 370-418. [Fascimile available online: the original essay with an introduction by his friend Richard Price]Birnbaum A. 1962. "On the Foundations of Statistical Inference",Journal of the American Statistical Association 53,259-326.Carnap, R. 1962. Logical Foundations of Probability, 2ndedition. Chicago: University of Chicago Press.Chihara, C. 1987. "Some Problems for Bayesian ConfirmationTheory", British Journal for the Philosophy of Science38, 551-560.Christensen, D. 1999. "Measuring Evidence", Journal ofPhilosophy 96, 437-61.Dale, A. I. 1989. "Thomas Bayes: A Memorial", The MathematicalIntelligencer 11, 18-19.----- 1999. A History of Inverse Probability, 2ndedition. New York: Springer-Verlag.Earman, J. 1992. Bayes or Bust? Cambridge, MA: MITPress.Edwards, A. W. F. 1972. Likelihood. Cambridge: CambridgeUniversity Press.Glymour, Clark. 1980. Theory and Evidence. Princeton:Princeton University Press.Hacking, Ian. 1965. Logic of StatisticalInference. Cambridge: Cambridge University Press.Hájek, A. 2003. "Interpretations of the Probability Calculus", in the Stanford Encyclopedia of Philosophy, (Summer 2003Edition), Edward N. Zalta (ed.), URL = <http://plato.stanford.edu/archives/sum2003/entries/probability-interpret/>Hammond, P. 1994. "Elementary non-Archimedean Representations forof Probability for Decision Theory and Games," in P. Humphreys, ed.,Patrick Suppes: Scientific Philosopher, vol. 1., Dordrecht:Kluwer Publishers, 25-62.Harper, W. 1976. "Rational Belief Change, Popper Functions andCounterfactuals," in W. Harper and C. Hooker, eds., Foundations ofProbability Theory, Statistical Inference, and Statistical Theories ofScience, vol. I. Dordrecht: Reidel, 73-115.Hartigan, J. A. 1983. Bayes Theory. New York:Springer-Verlag.Howson, Colin. 1985. "Some Recent Objections to the BayesianTheory of Support", British Journal for the Philosophy ofScience, 36, 305-309.Jeffrey, R. 1987. "Alias Smith and Jones: The Testimony of theSenses", Erkenntnis 26, 391-399.----- 1992. Probability and the Art of Judgment. NewYork: Cambridge University Press.Joyce, J. M. 1999. The Foundations of Causal DecisionTheory. New York: Cambridge University Press.Kahneman, D. and Tversky, A. 1973. "On the psychology ofprediction", Psychological Review 80, 237-251.Kaplan, M. 1996. Decision Theory asPhilosophy. Cambridge: Cambridge University Press.Levi, I. 1985. "Imprecision and Indeterminacy in ProbabilityJudgment", Philosophy of Science 53, 390-409.Maher, P. 1996. "Subjective and Objective Confirmation",Philosophy of Science 63, 149-174.McGee, V. 1994. "Learning the Impossible," in E. Eells andB. Skyrms, eds., Probability and Conditionals. New York:Cambridge University Press, 179-200.Mortimer, Halina. 1988. The logic of induction, Ellis Horwood Series in Artificial Intelligence, New York; Halsted Press.Nozick, R. 1981. Philosophical Explanations. Cambridge:Harvard University Press.Renyi, A. 1955. "On a New Axiomatic Theory of Probability",Acta Mathematica Academiae Scientiarium Hungaricae 6,285-335.Royall, R. 1997. Statistical Evidence: A LikelihoodParadigm. New York: Chapman & Hall/CRC.Skyrms, B. 1987. "Dynamic Coherence and ProbabilityKinematics". Philosophy of Science 54, 1-20.Sober, E. 2002. "Bayesianism — its Scope and Limits", inSwinburne (2002), 21-38.Sphon, W. 1986. "The Representation of Popper Measures",Topoi 5, 69-74.Stigler, S. M. 1982. "Thomas Bayes' Bayesian Inference",Journal of the Royal Statistical Society, series A145, 250-258.Swinburne, R. 2002. Bayes' Theorem. Oxford: OxfordUniversity Press (published for the British Academy).Talbot, W. 2001. "Bayesian Epistemology", Stanford Encyclopedia of Philosophy (Fall2001 Edition), Edward N. Zalta (ed.), URL = <http://plato.stanford.edu/archives/fall2001/entries/epistemology-bayesian/>Teller, P. 1976. "Conditionalization, Observation, and Change ofPreference", in W. Harper and C.A. Hooker, eds., Foundations ofProbability Theory, Statistical Inference, and Statistical Theories ofScience. Dordrecht: D. Reidel.Williamson, T. 2000. Knowledge and its Limits. Oxford:Oxford University Press.Van Fraassen, B. 1999. "A New Argument forConditionalization". Topoi 18, 93-96.Other Internet ResourcesFitelson, B. 2001. Studies in Bayesian ConfirmationTheory, Ph.D. Dissertation, University of Wisconsin. [Preprint in PDF available online] (750K download)Bayes' Original Essay (in PDF) (UCLA Statistics Department/History of Statistics)A Short Biography of Thomas Bayes (University of St. Andrews, MacTutor History of Mathematics Archive)The International Society for Bayesian Analysis (ISBA)Related Entries epistemology: Bayesian | probability, interpretations of Copyright © 2003 byJames Joyce<jjoyce@umich.edu> |
|