DECONSTRUCTING RECONSTRUCTION:
WHEN ARE THE RESULTS OF PARSIMONY AND STATISTICAL
PHYLOGENETIC ANALYSIS GENUINE ADVANCES?
Richard H. Zander
Buffalo Museum of Science
1020 Humboldt Pkwy
Buffalo, NY 14211
February 12, 2001
(moved to http://www.mobot.org/plantscience/ResBot/ April 28, 2003)

Return to Home

Revision History

 


DECONSTRUCTING RECONSTRUCTION:
WHEN ARE THE RESULTS OF PARSIMONY AND
STATISTICAL PHYLOGENETIC ANALYSIS GENUINE ADVANCES?

Bryology Seminar, Missouri Botanical Garden

Richard H. Zander, 19 Feb. 1999 (Revised Feb. 12, 2001)

INTRODUCTION

This work is intended for the average systematist with a cautious interest in phylogenetics or the phylogeneticist with concern for reliability. The past 30 years have seen the publication of many exact—even if sometimes billed as "not well supported"—phylogenetic solutions. The publication of an exact solution when not well supported involves the high probability of Type I errors, which is the acceptance, even temporarily, of a wrong phylogenetic hypothesis (e.g., a particular internal branch, subclade, or tree) because the null hypothesis of no support is, wrongly, considered false. The philosophy that supports this practice is hypothetico-deductivism (or falsificationism), which accepts a theory if it resists falsification. This is opposed to verificationism, which rejects a theory that is not supported by numerous observed instances.

It might be said that the two philosophical attitudes are pragmatically equivalent, since one chooses for use those theories that are fruitful. predictive in practice, and promote understanding, whether supported hypothetico-deductively or by verification. The problem arrives with retrodiction (or postdiction), the reconstruction of one-time past evolutionary events that cannot be directly observed. Any internal branch in a rooted tree has three possible arrangements of the two terminal and one basal branch (= "nearest neighbor interchange"). Given that evolution happened and limiting our attention to only those three lineages, only one arrangement is true. Thus, there is twice the chance of a Type I error (acceptance of a false phylogenetic hypothesis) than a Type II error (the rejection of a true phylogenetic hypothesis because one wrongly decided the null—of no support for the optimal tree—was true) for each branch. Clearly, hypothetico-deductivism is a philosophy that allows or even encourages publication of exact solutions with their concomitant Type I errors, no matter how poorly supported.

The relatively new microcomputer-based ability to analyze massive amounts of data in developing exact solutions has been encouraging to systematists. Phenetics, however, has been something of a dead end. Cladistics, in modeling evolution (yes, it does . . . that is its attraction), is a popular way of analyzing data, but statistical analysis and philosophy-based parsimony analysis are generally considered oil and water. Statistics is the spine of science. The rejection of statistics by most past and many present-day phylogeneticists is a major problem, but "statistical phylogenetics" has its own problems.

How did this situation come about? Is there a solution? This work is a reprise of presentation and criticism at a Missouri Botanical Garden Bryology Seminar (well attended by non-bryologists) of some of the philosophical bases or at least tendentious arguments for present-day cladism and statistical estimation of phylogenetic relationships. There is also a handout.


 

THE TALK

Botanists have, in the past, little familiarity with statistics. When I was introduced to botany, the symbol for any number greater than 64 was:

click here

Presently, however, to really be in touch with the cutting edge of evolutionary analysis, one needs to understand at least in a general way:

what one needs to understand at least in a general way

This is a working method of calculating the probability that a particular evolutionary scenario happened, given certain information and certain assumptions. For non-adepts, here are some simple examples of calculating probabilities:

PROBLEM:
You have a coin (a two-sided die) and a 20-sided die.
The coin has a "1" and "2" on its sides; the 20-sided die has a "1" and numbers up to "20" on its sides.
Someone tosses randomly one or the other until a "1" comes up.
What is your chance of selecting which object generated the data set "1"?

ANSWERS:
Parsimony: It is easier to generate a "1" with the coin than with the die, and the coin is the simplest explanation.

Bayesian Estimation: One follows the dictum that the probability of an hypothesis is the same as the probability of the data set, given that hypothesis. Assuming prior probabilities are uniform for both coin and die (fair throws and no loading), these simplified formulas apply:

     Coin probability:

 

     20-sided die probability:

Conclusion: The coin has a posterior probability of 0.91 of generating the data set "1", while for the 20-sided die, the probability is only 0.09. The coin is ten times more likely to have generated the "1".

Classical Frequency-Based Statistics: Statistics based on single throws are meaningless.

 

MAXIMUM LIKELIHOOD AND THE ABOVE EXAMPLE

Modern phylogenetic analysis places much emphasis on maximum likelihood as a means of reconstructing evolution based on DNA sequence data. There are problems with this, however.

The coin's likelihood of generating the data set "1" is 1/2. This is greater than 1/20, and is the hypothesis of maximum likelihood. A measure of statistical support for this solution is the difference between likelihoods, with the likelihoods usually expressed as natural logarithms.

     ln 1/2 = -0.59
     ln 1/20 = -3.00

The difference = 2.31, which is called the likelihood ratio. This difference is 10 times in non-logarithmic terms: ln 10 = 2.31.

So the coin is ten times more likely than the 20-sided die to generate the data set "1". Using natural logarithms is overkill here but convenient when dealing with the very, very small likelihoods of particular tree topologies.

We will get back to maximum likelihood in a while, but first let's look at:

 

CLADISTICS

Although cladistics offers "parsimony," "simplicity," and "best explanation" as logical bases for choosing the shortest tree, these are only admissible as sole justification when there are clear and few alternatives, and there is no possible great loss involved in accepting the simplest solution. Wittgenstein has pointed out that simplicity has no logical justification:

click here for Wittgenstein

Hennigian cladists, however, have promoted the use of simplicity.

simplicity

It is evident that accepting any exact solution leads to Type I errors (accepting as right a wrong phylogenetic hypothesis). One can always choose, however, not to choose when there is not far more evidence for than against. The key, of course, are the words "in the absence of contrary evidence" in the above quote. There is almost always contrary evidence, usually not detailed or emphasized in press.

Cladistics also implies that a cladogram is a "discovered" nest of relationships, something quite real in nature. But even Karl Popper, a promoter of hypothetico-deductivism, was leery of realism:

click here for Popper and realism

Popper plunked for better explanatory power or evidence for such, as in "approaching the truth," but he never had to choose among, say, 10 to the fortieth power topologies.

click here for Popper and better explanatory power

Philosophers of science commonly urge particular logical or at least rhetorically impressive criteria for choosing among competing theories, but seldom suffer from any wrong scientific choices they make, since . . . they don't make them. Those who actually make such choices can affect the quality of their science with Type I errors unless a reliability measure is a standard part of a study.

Salmon is another philosopher of science who championed simplicity and best explanations:

click here for Salmon and best explanation

Salmon (as well as Popper and others) is generally speaking of the statistics of single events (which are usually dealt with in Bayesian statistics by non-philosophers), such as reconstructing a past event. Salmon also encouraged seeing an increase in probability with additional data as indicating a correct choice even though the probability remains small.

click here for Salmon and increased probability

Cladists and others who espouse the doctrine of "approximating" truth with best explanations also strongly advocate the philosophy of realism. Scientists are mostly realists, in that we believe that there really are "things out there" that are sampled in our collections, and are represented in out data sets, and are modeled in our theories. There is a significant problem with "approximating" truth, however, that can blind us when there is more summed evidence against than for a "best theory," or there exists a well supported second-best alternative. There is also the additional problem that a scientist can easily mistake a theory for a reality.

Click here for realism and approximating truth

 So when is there enough evidence for one hypothesis that one can reject another, or all others?

Consider the following cladogram:

 

The data set is given on the left. If there are five (advanced) traits shared by A and B, and none by B and C or A and C, then there is no contradiction in the data set.

How about:

Here B and C share one character ("1"). Thus we have 4 characters for and 1 against the above optimal cladogram ((AB)C). If all shared characters are equal evidence for phylogenetic relationship, then the researcher has a 4:1 chance of being right in selecting this cladogram rather than ((BC)A).

 

How about:

Now A and B share 5 traits ( numbers 1-5) but A and C share 4 traits (number 6-9). Though ((AB)C) is a "best" explanation, the chance of selecting the correct cladogram is 5:4. The cladogram on the left above is one that philosophical cladists might term "poorly supported." Now, is it poorly supported or actually little more than the result of flipping a coin?

This example applies to one internal branch connecting A & B as the cladogram ((AB)C) and its two alternatives ((AC)B) and ((BC)A). But cladograms have generally many internal branches. Consider:

Suppose these were connected into a big cladogram, with A, B and C representing different lineages attached to each internal branch. If each optimal internal branch were selected as correct with a 5:4 (or 5 out of 9) chance, then the whole cladogram has a chance of being correct of 5/9 to the sixth power. This is a miniscule chance of being totally correct though the cladogram still manages to meet the optimality criteria of most parsimonious, best explanation, approximating the truth, and so on. 

If this was all there was to it, there would be no problem. We would tend to reject all cladograms as improbable. One can, however, make probabilistic or otherwise assured theories about relationships: there are the class of relationships called "uncontested groups" that no one would dispute. Extreme example: cows and horses are more closely related to each other than either is to the sponge. Most cladograms have support (by some measure) somewhere in between that of uncontested groups and arrangements more akin to games of chance. But how to tell what support any one cladogram really has? [Note: I present an exact method of gauging support for each internal branch in issue 3 of Systematic Biology, 2001. See citation in the bibliography of this talk's handout.]

Note: What about Bremer support (= decay index) and subsampling (bootstrapping and jackknifing)? These are commonly used measures of branch support. The reader is referred to the handout, especially citations of Oxelman et al. (1999), Rice & al. (1997) and Yee (2000) for disconcerting criticisms of these support values.

"Corroboration" has been a watchword in cladistics, but it is commonly used when only congruence is meant:

CORROBORATION is significantly increasing support, or maintaining the very high probability of one tree, with additional data, and ultimately implies "no reasonable doubt."

CONGRUENCE is the same level of support of both for and against a hypothesis.

CONSILIENCE is congruence of data sets produced with somewhat different natural processes, such as data on morphology and molecular analysis.

In "Congruence and corroboration" above, two data sets (of advanced traits) corroborate the conclusion that A & B are more closely related to each other than they are to C. But in "Congruence and no corroboration," the two data sets are congruent, but do not corroborate the hypothesis (there remains equal evidence for and against the optimal solution.

BUT NOW, consider a data set with 6 characters shared by A & B, and 4 other characters shared by A & C; the optimal tree associates A & B as terminal groups. A second data set has also 6 characters supporting the A & B lineage, but 5 other characters supporting A & C. In this case, the second data set corroborates the lineage A & C and falsifies A & B, even though A & B continue to be the terminal groups in the optimal tree. In my opinion, discussing congruence, corroboration and falsification is splitting hairs, because no exact and reliable solution is supported. A & B and A & C have nearly similar support and one might as well toss a coin. Refusing to choose the exact, optimal tree may or may not be a Type II error (rejecting the optimal tree when it is correct), but it is a fail safe solution for the researcher, students and scientists in other fields looking for reliable estimates of phylogenetic relationships.

QUESTION: If two consilient data sets produce the same shortest tree, even though that one tree is poorly supported, surely that cannot be rejected as random or sampling error?

ANSWER: Congruence supports all reasonable trees. The same two data sets can together support two or more different hypotheses, and these can be contradictory.

In a treatment of the taxonomy of giant lobelias in Africa, using chloroplast DNA, researchers got the following optimization:

click here for a cladogram of giant lobelia relationships

Note that branch length, though uncomfortably short (for gene trees) in many branches, is often about the same as the decay index (= Bremer support). This means that there is little conflicting evidence. In other branches, however, Bremer support is rather less than the branch length. This implies the existance of a contrary alternative branch of considerable length (the branch length minus the Bremer support). In addition, it is well known that what one gene gives, another takes away, since gene histories often conflict (lineage sorting and the like). Check the following conclusion of these authors:

click here for a pictorial review of the evolution of giant African lobelias

In this case, chloroplast DNA gene history is presented as equivalent to species evolutionary history. Out of context, this is a beautiful and compelling illustration, sure to grace the pages of a textbook someday. There are some qualifying words in the text, yet the potential for Type I error (accepting a wrong hypothesis as true) is large not only for the authors of this paper, but for fellow evolutionary phylogeneticists, students, teachers and for scientists who might use these results to guide conclusions in research efforts in other fields. The picture is, of course, not all wrong and there is good support for many lineages, but how is the ambiguity presented? In the interpretation of the interrelationships of branch length and Bremer support, something not all readers will bother to do, and in the context, familiar to phylogenetic experts but scarcely all readers, of the multifarious problems and assumptions involved (see handout).

Let's combine two different data sets on a different group of taxa. Here are the published cladograms resulting from:

data set A (rbcL)

data set B (18s rDNA)

Result: Combined data set

The combined data set looks good! But how did it get that way?

Did conflicting data cancel each other out? It is well known that one can get nice, sometimes well supported trees from totally random data. (Note: I wrote a little DOS program that generates random data sets for those who want to experiment. Write me at rzander@sciencebuff.org for a copy.) My own take on this is that if one uses optimality criteria, there are almost always "best" results, no matter how data is combined. Chance will increase support for some results, decrease it for others. This is especial problematic when differential lineage sorting (different gene histories as mentioned above) is ignored.

When combining data sets, a "multiple tests" statistical problem arises. A statistician can manipulate data in a myriad ways, looking for a "significant" result, i.e. something that meets or exceeds a pre-established confidence level. Correction for multiple tests is commonly done (using Bonferroni correction) by dividing alpha (your toleration for Type I errors) by the number of tests. Thus, some branches will be found to be better supported with combined data sets, but that increase in support must be lowered by some correction factor. Again, researchers who tend to tolerate Type I errors also tend to ignore contrary evidence.

Conflicting data have been thrown out if the authors decide that, for instance, only two particular data sets out of the three available combine to make a nice composite cladogram, and that these are sufficient. Note in the following that the authors feel it appropriate to use the ITS data set only to address intragroup relationships . . . clearly a decision made to view exact results as "approximating truth."

data thrown out (ITS)

It is possible, of course, to throw out data sets for all kinds of good reasons (wrong rate of evolution, tracking a demonstrably divergent gene history, sample error, and so on). When such reasons are not given, or are specious, the reader must be cautious about any exact solution.

The principle of total evidence is commonly cited as the justification for combining data sets:

click here for Hemple's explanation of total evidence

Why not use all evidence? My answer is that only data sets produced by the same process can be logically combined. Thus, more data about a particular gene history is good, but one cannot combine data about different gene histories unless an impressive theory is available that explains what happens when gene histories are somehow averaged to get a species phylogeny, and such a theory is not available. Given the prevalence of conflicting gene histories, a large number of data sets is probably needed to distinguish those that track species history (as noted by Nei and others, see handout).

Also, optimality methods look only at the positive side of the results of combining all evidence. If one actually weighs the evidence both for and against a solution (one particular solution chosen beforehand) and if support increases, then using total evidence can be okay. Choosing one solution after combining data sets is a multiple tests problem.

 Now:

MAXIMUM LIKELIHOOD

This is the method of hope of statistical phylogeneticists (versus the philosophy-based analytical procedures of cladists).

Edwards is the big gun on maximum likelihood:

click here for a simple statement by Edwards

LIKELIHOOD RATIOS
are used to gauge support for a particular solution in maximum likelihood calculations:

further explanation by Edwards

Simple example:
You have a 4-sided die and a 6-sided die. You roll them randomly and look for the first to generate a "1". The likelihoods for each die of generating the data set "1" are:

     ln 1/4 = -1.38
     ln 1/6 = -1.79 (difference between 1/4 & 1/6 = 0.41)

Support for the 4-sided die is 1.5 times the second most likely solution for generating the data set "1". That's the likelihood ratio.

[Note added February 24, 2001:]

BOTH PARSIMONY AND MAXIMUM LIKELIHOOD PRODUCES EXACT SOLUTIONS

A preference for an exact solution gets you something for nothing. Even a bush (a multifurcating tree) is an exact solution that means nothing unless the lineages below it (though not shown) are very well supported (so the branches cannot be thought of as possibly positioned elsewhere).

Example: You suspect a coin is loaded. You toss it 100 times. It comes up 50 tails and 50 heads (a bush). Your best answer is that it is not loaded. A different coin that you check for loading comes up 51 tails and 49 heads, so your best answer is that it is loaded. Yet . . . the answers are scientifically, if statistics means anything, equivalent. The null hypothesis of not being loaded cannot be rejected, and you are left with nothing. With most cladograms, the null hypothesis of no phylogenetic loading cannot be rejected by the data presented (when looking at an entire tree of many taxa).

But! What about support values? The decay index is always relative to the length of the branch; thus a branch with length 40 and decay index 10 may have one or two alternative branches (nearest neighbor interchange) that are up to 30 steps in length. Bootstrapping is a wonderful tool (an analog of exact binomial calculations) that gives a good basis for evaluating pre-selected confidence levels but bootstrapping is calculated by examining whole trees, and homoplasy (which is rampant in most molecular data sets) affects it (undoubtedly lowering its calculated value). Also, bootstrapping cannot deal with the problem of two conflicting apparent phylogenetic signals (e.g. when one of the alternative branches calculated after nearest neighbor interchange is nearly as well supported as the optimum branch and both are significantly longer than the shortest branch).

On the other hand, what about the case when longer trees are much less well supported . . . does this mean they can be ignored? Statistics is the spine of science. Consider this example: you have a chicken yard. There is a big chicken and 50 little chicks (each one dyed a different Easter egg color) in the yard. You toss a kernel of corn into the yard and glance away, and fttt it was eaten. Which bird ate the kernel? You toss more kernels randomly and find from the data set you compile that the big chicken is 50 times more likely to eat a kernel than any chick, and each chick is about as likely as any other chick to eat a kernel.

Maximum likelihood analysis would say that the big chicken ate the original kernel with a likelihood ratio of 50! (i.e., comparing likelihoods of the hypothesis of maximum likelihood and the secondmost likely.) Wow! However . . . all the birds contributed to the data set, and any bird that contributed to the data set cannot be ignored, can it? Therefore the chance of the big chicken eating the original kernel was 50%. (Maximum likelihood gives you something for nothing if you trust in likelihood ratios and you have more than two possible hypotheses.)

But! Note that no one chick (alternative hypothesis) had a likelihood anywhere as high as that of the big chicken! What is the chance we can eliminate the likelihoods of the chicks as irrelevant and just too small to matter? We can't, because they all contributed to the data set, and only if we can eliminate them from the data set can we eliminate their summed probabilities (summing to 50%). And there is no empirically based theory that will allow us to do so (or to eliminate long trees, since these also must be considered as contributing to a cladistic data set since any one of them could have been solely responsible for it).

But! What is the chance that a 50:1:1:1...(50 ones)% probability distribution would happen by chance alone? Well, the distribution of likelihoods is not a data set of observations (not a sample), and we can't do chi-squared or other non-parametric analyses on these. This probability distribution would be approximately the same every time you created a data set with these birds.

The situation with cladograms is worse than this extreme example because there is doubtless no sharp distinction between the likelihood of the shortest tree and that of the the secondmost short tree and the thirdmost and the fourthmost, etc. (unless we have very, very few taxa in the data set).

Therefore we really can get something for nothing, but not only chickens will squawk. An exact solution is publishable through the magic of the philosophy of parsimony, even though there are doubtless . . . doubtless many almost as well supported alternative trees. I limit this comment to trees of many taxa. Four-taxon trees are a special case and non-parametric tests of support are possible.

Since cladistic and maximum likelihood analyses are optimizations, of course they approximate general intuitive evaluations of phylogenetic relationships (e.g., "uncontested groups"). However, the special qualifications for respect and attention of these methods of phylogenetic analysis is that they are more exact than intuition. I submit that such greater precision is larely artifacts of philosophy, rhetoric and statistical gobbledegook. I'm sure that somewhere in published exact results there is greater precision and as such it is an advance in knowledge, but it is very hard to tell such an advance from nonsense.

From the above discussion, you can estimate my opinion of efforts in creating a Phylocode as a substitute or even as an alternative for the flexible-though-imperfect standard codes we have now.

 

A SPECIAL PROBLEM WITH MAXIMUM LIKELIHOOD ANALYSES IN PHYLOGENETIC ANALYSIS

Note: If you see in a paper that support (of a solution against the second likeliest solution) is -ln = 5000.0 versus -ln = 5002.0, this can be interpreted as a difference of ln 2. Now ln 2 = e to the second power. If e = 2.7, then e squared = 7.4. So the solution of maximum likelihood (5000.0) should be 7.4 times as likely as the second most likely solution (5002.0).

But actual maximum likelihood analysis with sequence data cannot use likelihood ratios to measure relative support for one tree over another. This is because likelihoods of nucleotides are maximized on each topology, and each topology is a different model (see handout for relevant citations).

A really simple example:

You have a coin (labeled "1" and "2" on its sides), a 4-sided die, a 6-sided die and a 20-sided die.

Q: If these four are each randomly selected and thrown randomly, which one has maximum likelihood of turning up a "1"?

A: The coin, with a likelihood of 1/2 and a likelihood ratio of 1/2 minus 1/4, or .5 (the coin is twice as likely as the 4-sided die for generating the data set "1").

BUT
Suppose you had the same four objects, and the coin and 6-sided die were in box A, and the 4-sided die and the 20-sided die were in box B. Which Box would have the best chance of generating a "1" if the objects were again randomly selected and randomly thrown? Box A. But what is the support? One cannot simply compare the likelihoods of the coin and the 4-sided die to get a likelihood ratio. Instead, the likelihoods of all the objects need to be taken into account:

Box A

 

Box B  

ln 0.69 = -0.37; ln 0.31 = -1.17; difference = 0.8; ln 2.2 = 0.8

SO: Box A is 2.2 times as likely as Box B in generating data set "1", and 2.2 is the support for Box as being the solution. This is an impossible calculation for sequence data sets involving more than a very few taxa.

Monte Carlo sampling has been used in analyzing the relationships of groups of many taxa.

 

FULL BAYESIAN ESTIMATION AND MARKOV CHAIN MONTE CARLO STUDIES

This is a way of using nucleotide sequence data from many taxa to evaluate posterior probabilities of possible phylogenetic trees. It is less favored by statistical phylogeneticists, largely because this method is new and the software is not yet commonly available in an easy-to-use format. There are major problems with this method, however.

Bayesian analysis is the mathematically the most complex and abstruse of phylogenetic statistical methods. Most statistical textbooks give a good account of Bayes' Theorem, which is fairly simple. Very, very simply, with Bayes' theorem one takes the likelihood of the data for a hypothesis (times any prior probability), divides it by the sum of the likelihoods of the data for all possible hypotheses (times any prior probabilities), and this equals the "posterior probability" (or the probability of the hypothesis being true). We did this above, with boxes A & B, with the assumption of uniform prior probabilities (no loading and fair throws).

Bayesian analysis concerns the probability of single events, which are figures that some statisticians do not believe in as "real" probabilities (long-run frequency statistics can have very accurate predictions, not so with single events). But Bayesians statisticians win their bets on single events in the long run. The trouble is that science cannot tolerate solutions that are correct only a little more than half the time; that wins in the long run only in a casino.

Because of its complexity, straight Bayesian analysis of phylogenetic relationships is computationally doable for only a very few taxa:

click here for the math

it can be come up with impressive results, however

With larger data sets, instead, sampling methods called Monte Carlo methods are used to estimate the probabilities of the most likely tree topologies in explaining the data set.

There is a "credible zone" in Bayesian analysis, similar to the confidence interval in classical statistical analysis. A hypothesis that comprises a researcher-selected 95% credible zone (or interval) is one that is very probable. Several hypotheses may add their probabilities to the credible zone. For instance, suppose you have 7 trees. Their probabilities are:

0.50, 0.30, 0.15, 0.03, 0.01, 0.005, 0.005

Then the first three trees, whose probabilities add to 0.95, comprise the "credible zone." How these trees are similar is a probabilistic solution.

Here is a typical solution, with the plant species Clarkia:

click here for an interesting Markov chain Monte Carlo analysis

The probabilities of the first three trees in the quoted paper above add up to a pre-selected 99% credible zone, and one tree is much more likely than the second best. Sounds good, even though the best tree has a posterior probability of only 64.9%. There is 35.1% evidence against this solution, but that evidence supports no one alternative tree. At this point one might muse on the question that if there is one weakly supported optimal solution and a host of other solutions that are each very poorly supported, does this mean that one can ignore the other solutions (and in effect raise the probability of the optimal solution to 100%)? Or might the optimal solution at least sometimes be a blip or random combination of data that is unrelated to the phylogenetic history? (Again, random data often can be used to generate well resolved cladograms with some branches well supported by Bremer support—but seldom not bootstrapping. To get a DOS executable of my random data set generator drop me a note at rzander@sciencebuff.org).

The same researchers did a MCMC study of cichlid fish mtDNA data, of 32 species, yielding 10 to the 40th power different possible topologies:

cichlid fish results

The posterior probability of the most likely tree was low, so the researchers combined particular groups of taxa (that's okay) and found some decent figures:

click here for decent figures

The most likely clade had a posterior probability of 64.5%, and the next most likely had a figure of 10.2%. Although this is an impressive achievement, one must ask oneself if this result is sufficiently probable (reliable, true) to base other research on, say, biogeographic conclusions? It might be, with certain, specified qualifications.

One problem the researchers pointed out is that different techniques gave different results:

click here for comparison of annoyingly different results

An important problem with any such computation is that many trees have probabilities too small to calculate. When a tree has 10 to the fortieth different topologies, the sum of the probabilities of the many trees with so-small-as-to-be-non-calculable probabilities may be significant compared to the "credible zone." Thus, the probabilities of the most likely clades in Monte Carlo sampling studies are always relative to the sum of the probabilities of the trees actually calculated, not the sum of the probabilities of all possible trees, including those not sampled and those not calculated.

There are more kinds of topologies that must be taken into account in likelihood and Bayesian calculation where statisticians distinguish a difference between trees and dendrograms. The probabilities of these must be summed to get a composite posterior probability of a clade:

summing probabilities

The math of these treatments is impressive, but remain optimal solutions that depend on a complex of assumptions and data. But you can do these yourself! Software is now available (PAML by Yang, and BAMBE by Simon and Larget) for anyone to try their hand at Bayesian MCMC analyses. One needs to be able to set a few options:

options in PAML MCMC

or at least understand what the default settings mean, watch the output while thousands of topologies are "visited" by the Monte Carlo program (not too long with fast computers):

streaming output

then, interpret the output summary:

output summary

where the number of times a topology is visited is directly proportional to the likelihood. One major stumbling block with both maximum likelihood and Baysian MCMC analyses are the many regularity assumptions and guesses in regard to the evolutionary model that must be made (see handout).

 

ERROR

Error is a fact of life. This talk emphasizes that Type I errors are to be avoided, since refusing to accept an exact solution, even when a Type II error (rejecting a true phylogenetic solution because the null hypothesis of no support for the optimal tree cannot be falsified) may be involved, is fail safe. But there is no progress without taking the chance of Type I errors. Some writers assume (incorrectly) that we must make (error-ridden) decisions:

click here for assumption that we must make errors, Type I or Type II

The reason accepting no hypothesis is fail safe is strictly pragmatic:

pragmatism and error

where minimax solutions (minimizing the maximum possible loss) are to be sought.

What are the possible consequences of being in error in phylogenetic reconstruction? What is driving the tendency to publish exact results of phylogenetic analysis whether well supported or not? Why does one brave Type I errors and convince oneself of "discovering" (approximately) an apparently "real" nested hierarchy in nature? The gain/loss table tells all:

 


                                                      GAIN/LOSS TABLE

                    HYPOTHESIS ACCEPTED      HYPOTHESIS REJECTED
 
IF PHYLOGENETIC     Type I error.            Eventual satisfaction.
HYPOTHESIS          Satisfaction.            No glory, no grants.
IS REALLY FALSE     Glory, grants.      
                    Problems for others.
 
 
IF PHYLOGENETIC     Eventual satisfaction.   Type II error.
HYPOTHESIS          Glory, grants.           No problems but 
IS REALLY TRUE                               No glory, no grants.  

 

Note: It is easy to point out error, and comb the literature to find supportive quotes for one's own ideas. I have, however, in press (cited in handout) a means of measuring the reliability of each cladogram branch using a non-parametric test (chi-squared) on the three alternative branch lengths obtained from nearest-neighbor interchange. It points out where some optimal branches may have been chosen by little more than flipping coin (a three-sided coin), and where there appears to be probabilistic support at an acceptable confidence level for the optimal branch. Yee (2000, cited in handout) recently offered a similarly non-parametric method of evaluating branch support, but his method is problematic for two reasons: the whole cladogram is involved in calculating each branch probability not just the three alternatives from nearest neighbor interchange, and the signed-ranks probabilities are used throughout as branch probabilities (a Bayes simple proportion should be used to calculate the probability of selecting the correct alternative branch arrangement when a pre-selected confidence level is not attained).

J. Lyons-Weiler (pers. com.) has pointed out, quite rightly, that the generation of any optimal tree "throws away the error term." Luckily parsimony still leaves plenty of error for my method to root out. He has championed treeless analysis for phylogenetic signal (RASA), which has much potential.

Many programs are available for phylogenetic reconstruction. Nearly all of them will favor a Type I error over Type II; thus, they may be fail-safe for the researcher, but not for associated students or workers in other fields who rely on the results. M. E. Sidall has a good summary of the logical and statistical bases behind phylogenetic methods. He is not particularly keen about any one method. On the other hand, the data is there and I am sure the potential for new knowledge is when we distinguish a true advance from nonsense.

 


I thank Marshall Crosby and Robert Magill for hosting this seminar. Since this was an informal talk, author citations were largely eliminated, and I hope readers can distinguish between my own ideas and when I present those of others. If in doubt, full attributions are given in papers that are cited in the handout.

Revision History

February 24, 2001
April 10, 2001 Comments on Decay Index and bootstrapping.

 

 

<script type="text/javascript">

var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");

document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));

</script>

<script type="text/javascript">

try {

var pageTracker = _gat._getTracker("UA-3783322-4");

pageTracker._trackPageview();

} catch(err) {}</script>