Buffalo Museum of Science 1020 Humboldt Parkway Buffalo, NY 14214-1293 www.buffalomuseumofscience.org Sept. 11, 2001 |

Richard H. Zander Buffalo Museum of
Science, 1020 Humboldt Parkway, Buffalo, NY 14214-1293 U.S.A. (present
address:
This
non-parametric method is a modification of the Conditional Probability of
Reconstruction method for gauging reliability of individual branches by
comparing alternative putative branch lengths obtained with nearest neighbor
interchange and recalculation under restraint of a single optimal internal
branch. The CPR uses a chi-squared test on the three alternative branch
lengths and a null of a random distribution of these three lengths. With
fewer than 15 loci, however, one must use an exact binomial calculation and a
probability of 1/3. For probabilistic reconstruction of the species tree
conditional on the data and all methodological assumptions one needs—at
minimum—identical, well supported results from three different loci without
contradiction. This can apply to the whole tree or to lineages or to single
internal branches.
The expectation that DNA data will
provide solutions to previously intractable systematic problems is shared by
many recent authors. This, however, cannot be expected to be the case, at
least in the next very many years. There are major problems with reliability
with the results of molecular (and morphological) phylogenetic analyses, as
reviewed by Omland (1994). This is especially true because the best answer in
non-statistical cladistic analyses is not necessarily much better supported than
the second best, or the third best, which may be rather different hypotheses
(Zander 1998a). This applies to maximum parsimony in that bootstrapping,
although a direct substitute for an exact binomial calculation or a
chi-squared test, is not directional, i.e. not just one but two (or even
three) of the three possible arrangements of any internal branch are
phylogenetically "loaded", and the gene pool is coarsely
heterogeneous due to in part to lineage sorting (different
"speciation" times for genes and species). Also, Bremer Support or
the "Decay Index" (Bremer, 1988, 1994) is a poor measure of branch
support in data sets of very many characters because a branch of length n and
Bremer support of x may have an alternative branch length (through nearest
neighbor interchange) of n minus x, being an alternative, contradictory
branch length that could be comparatively large. These as well as additional
problems are discussed by Oxelman et al. (1999), Rice et al. (1997) and Yee
(2000). Likelihood ratios, the standard
measure of support in maximum likelihood studies, cannot be used in
phylogenetic analysis because the likelihoods are already optimizations (Nei,
1987; Yang, 1996). Full Bayesian studies such as Markov chain Monte Carlo
analysis yield posterior probabilities of different alternative trees that
appear to add to 100%, yet these are relative to the sum of the probabilities
of the myriad trees with likelihoods too small to be calculated (Mau et al.,
1997; Yang & Rannala, 1997), while priors for the chance of a mistake
that affects tree estimation in a regularity assumption or sample error are
not calculated (they are assumed to be "uniform" or
"uninformative"). Steel and Penny (2000) summarize additional
problems. Non-parametric tests of reliability that evaluate the whole tree,
such as the Templeton (Templeton 1987) and similar Kishino-Hasagawa (Kishino
& Hasegawa 1989) tests may demonstrate that the best tree is
significantly better than the next best, but many branches of that best tree
may be poorly supported. Even if an analytic method can demonstrate that all
alternative trees have probabilities much lower than that of the optimal
tree, because any one of the alternative trees is fully capable of generating
the data set, they cannot be ignored and the sum of their probabilities must
be taken into account. I have introduced (Zander 2001),
for measuring reliability of internal cladogram branches, a new
non-parametric test (Zander, in press) called the Conditional Probability of
Reconstruction (CPR), which uses chi-squared analysis after nearest neighbor
interchange of an internal branch under constraint. It tests whether the best
choice (the longest branch) of the three possible alternative arrangements
for any four nearest neighbor branches has support (measured as length of
branch) distinguishable from a random distribution of support for the three
alternative branch lengths obtained after nearest neighbor interchange at
some chosen level of confidence. The rationale is similar to that of
likelihood ratio test, which compares the likelihoods of the optimal result
with that of the second most likely result, in this case with the second and
third most likely results. The method is like that of Yee (2000), who used
the similarly non-parametric Templeton signed ranks test to compare each of
the branch lengths of an optimal maximum parsimony tree against that of the
most parsimonious clade inconsistent with it. Yee’s test, however, does not
take into account that there are two shorter alternatives to any one optimal
branch, not one, and these should be demonstrably indistinguishable
themselves from a random distribution of their lengths otherwise other
factors may be involved (e.g., introgression, chance great imbalance of
shorter alternative branch lengths that makes selection of one branch as
"phylogenetically loaded" impossible). Also, the total tree lengths
were compared in Yee's test, which then makes re-optimization of other branch
lengths relevant to that of the analyzed branch. The results of the CPR test (and
indeed all tests) are conditional on the data used, the assumptions employed
(e.g., models, optimality settings), and problems intrinsic to particular
methods of analysis, e.g. branch length heterogeneity in the true tree
affecting branch length estimates because such estimates are optimizations
(Lyons-Weiler & Takahashi, 1999), or intrinsic to evolutionary processes,
e.g. convergence, long-branch attraction, parallelism and introgression
(Avise, 1994; Doyle, 1992; Templeton, 1986). Essentially, one uses CPR
analysis to judge if a branch length estimate is distinguishable from a
random distribution vis-a-vis the two closest alternative branch lengths at a
particular confidence level, not whether the branch is true or not. This
method simply clarifies a parsimony optimization by eliminating the
possibility of random selection of one of three possible alternatives as a
genuine scientific result (cladomancy, or divination by trees). Thus, poor
support (a resolved but poorly supported tree) can be viewed as no different
that no support ( a bush) when internal branch support cannot be
distinguished from a random distribution at a selected level of confidence. Analysis (Zander, in press) of two
published data sets (morphology of the moss The CPR method can also be used to
evaluate the problem of conflicting gene lineages in molecular phylogenetic
analysis, which is also entirely relevant here. For example, there has long
been controversy over the relationships of humans, chimpanzees and gorillas
(Goodman, 1963; Goodman et al., 1989; Hyasaka et al., 1988; Miyamoto et al.,
1987, 1988). A meta-analysis of the What then are the fewest number of
genes that must be investigated to probabilistically determine species
evolution in light of contrary but presumably randomly generated gene
lineages? Nei et al. (1998) in a study of optimization and analytic methods
in phylogenetics, concluded that "...it seems to be necessary to examine
many independently inherited genes in the construction of reliable
phylogenetic trees for different organisms" repeating in different words
the conclusions of Pamilo and Nei (1988).
Analysis
on a minimal number of gene data sets may be done treating each gene as a
character (Doyle, 1992; Slowinski & PageF, 1999). An expected (null)
number of at least 5 per category is usually considered minimal for
chi-squared studies, implying that 15 well-supported gene trees (5 being the
null or expected number per category if randomly generated) are necessary given
equiprobability. If an exact binomial probability test (Lowry 2000; Peladeau
1994), however, the number of gene trees needed for a probabilistic
reconstruction may be less than 15. The formula for this test is: where
n is number of trials (loci studied), k is number of events tested for
(optimal tree), p is probability of one event in any one trial, and q is the
probability that the event will not occur in any one trial. This test asks
"what is the chance of k results out of n trials?" In this case the
calculated probability is for "k or more in n trials" by applying
the formula to all values of k equal to or greater than n. The term "tree" is used
throughout this paper but the discussion will usually equally apply to any
four-taxon unrooted trees, any three nearest neighbor terminal taxa, or to an
interior branch of any tree where the problem can be decomposed into its
simplest form. To simplify calculation in the face of little understanding of
contributory effects, the null hypothesis is that no evolution is occurring
and the three alternative branch lengths (from nearest neighbor interchange
and recalculation via constraint trees) are equiprobable and generated
randomly. With an exact binomial calculation with probability set at 1/3, one
can accept or reject the presence of phylogenetically loaded strong imbalance
of the optimal tree (or branch) versus the two less well supported
alternative trees, i.e., significantly non-random at, e.g., the .95
confidence level. The exact binomial test used is one-tailed because we want
to know the chance of distribution of a particular gene tree occurring by
chance alone, not the chance of any one of the three arrangements distributed
this way by chance alone—the particular tree of interest is the one generated
by maximum parsimony. There may be strong imbalance
among the two less well supported alternative trees, and if this is suspected
(e.g., in closely related lineages) an exact binomial test at 1/2 probability
(the optimal branch versus all contrary evidence) is In that lineage sorting can affect
both branch arrangements alternative to the optimal one, the null hypothesis
is that all three arrangements are random and affected equally by lineage
sorting. Thus a 1/3 probability binomial calculation should give
probabilities for identifying the weighting of one arrangement due to shared
characters versus random weighting of the other two due to lineage sorting.
A list of total numbers of gene
trees needed for a probabilistic reconstruction at various confidence levels
is given in Table 1, by an exact binomial calculation with probability at 1/3
and 1/2. This information is commonly given in more extensive and general
form in appendices to statistics textbooks as tables of "The Binomial
Distribution Function" (e.g., Richmond, 1964). Summarizing Table 1, if a
minimally acceptable probability of reconstruction is considered to be the
standard .95 (that is, these relative numbers of gene trees would appear 1 in
20 times if this data were generated completely randomly) and there is no
sense judging from the gene trees at hand that the support for the two
alternative less well supported branches obtained from nearest neighbor
interchange is much imbalanced, then the number of gene data sets necessary
for probabilistic reconstruction with small data sets is achieved under the
1/3 probability (equiprobability) model when 3 trees are the same, or 5 if 1
of them contradicts, 7 if 2 contradict, 9 if 3 contradict, and so on. Thus, if
one requires a .95 confidence level, one must plan on analysis of at least
three genes for a chance at probabilistic reconstruction of the species
phylogeny. Two alike gene trees allow an .89 probability of reconstruction,
but inasmuch as this corresponds to a 1 in 9.1 chance of having the same
result by chance (compared to 1 in 10 for .90 confidence level, or 1 in 20
for .95), the warnings of elementary textbooks against accepting a confidence
level that appears to be slightly less than a pre-selected level are cogent. Given how common contradictory
trees are in the literature, one must expect to analyze at least three well
supported trees representing gene evolution at three loci using exact
binomial analysis. If no satisfactory result is obtained at first, and data
sets of many loci, possibly as many as the 15 trees required for chi-squared
analysis, must be obtained. Even then, no acceptable confidence may be
attained during any of these analyses. One must assume then that confounding
processes are at work that disallow use of non-parametric tests that select a
single non-random (at a particular level of confidence) category from one or
more categories of otherwise apparently randomly generated values. Bayes’
formula with equal priors (in this case, a simple proportion) cannot be used
in evaluation of probability with small data sets that are samples; a fairly
large sample (30 or more unless a closely normal distribution is somehow
logically demonstrable) is required. It may be that the probability of reconstruction
attained in molecular studies is better than that provided by morphological
analysis, in which case the molecular results are to be preferred, but one
must remember that morphological trees are based on data sets that are not
mere samples of a larger universe of data. Although morphological data sets
are small, they are complete, Bayes' formula for probability of
reconstruction can be used, and it means much more when there is no
contradictory evidence among immediate alternative branch lengths against
reconstruction of a morphological lineage.
The CPR measure described above
simplifies calculation of branch reliability by using a parallel with the
likelihood ratio test. The CPR judges the significance of the longest branch
versus only the two most likely alternative lengths on the logic that if the
longest branch length is significantly better than the next two most likely,
other comparisons in probability calculations are unnecessary. Patristically
distant lineages may also be excluded from probabilistic calculations when an
analyzed internal branch is bounded by a demonstrably well supported branch.
A likelihood ratio-style rationale based on eliminating from consideration
taxa at a considerable patristic distance obviates the need for multinomial
probability calculations. A large tree is decomposed into several smaller
problems by calculating probabilities of reconstruction for individual
interior branches. Saitou and Nei (1986) used a
trinomial to estimate the least number of gene trees needed for a
reconstruction of species evolution. Similar calculations by Pamilo and Nei
(1988) were affected by their concern with evolutionary time between
speciation events, in that "the topological error introduced by sequence
polymorphism in ancestral species is substantial when the evolutionary time
considered is short and when the effective population size is small."
According to these authors, for a probability of .95 of obtaining the species
tree, when very short evolutionary time is involved, such as 0.5 ´
10 The three different arrangements
due to nearest neighbor interchange on a cladogram are not equivalent to the
three alternative gene trees (two of which have the same topology as the
species tree), and conflict may be due to lineage sorting and to other
processes, such as introgression and hybridization, mistaken orthology,
horizontal transfer (including hybridization), gene duplication and
extinction (Avise, 1994; Doyle, 1992; Maddison, 1997; Templeton, 1986). It
has been demonstrated (Eernisse & Kluge, 1993; Sanderson & Donoghue,
1989) many times that molecular as well as morphological data may be
homoplastic, involving apparent evolutionary convergence and reversal. The literature is replete with
solutions derived from combining conflicting data sets (discussions by Kluge,
1989; Miyamoto, 1985; Omland, 1994; Vane-Wright et al. 1992). Given that the
main problem is thought to be conflicts in gene lineages (versus species
evolution), and given that gene lineages that contradict species lineages can
theoretically be very well supported, then the results can be totally random,
or the most well supported tree may simply overwhelm another, or multiple
genes may undergo identical lineage sorting, or mixed signals may be wrongly
evaluated as low signal (Lyons-Weiler & Milinkovitch 1997). Solutions
attained this way are specious because data is optimized generally (though
not in the needed, critical detail) and cannot be rejected off-hand.
Humphries (1968: 195) has pointed out that a rule of total evidence is only
adequate if genuine statistical laws demonstrate the relevance of the
evidence. Thus, one should not combine data sets unless one is sure they were
generated by the same evolutionary process. In light of the level of
contradictory results in phylogenetic analysis, this cannot simply be assumed
but must be tested for. Comparing one morphological tree
and one gene tree is inappropriate given expected variation in gene tree
analyses of the same taxon. If a confidence level for meaningful results is set
at, say, .95, then any probabilities obtained by non-parametric analysis less
than that are not meaningful, such as many of the branch probabilities given
by Yee (2000). Bayes’ Formula with uniform priors (equivalent to a simple
proportion) may be used for morphological data sets because the three
alternative branch lengths comprise a sample that is equal to the sample
space, and can be used for molecular data sets when the three alternative
branch lengths add to 30 or more, these being sufficient to be very closely
normal in distribution (Spiegel 1988: 176), and matching the binomial
distribution of the much larger sample space. Total samples less than 30,
however, if not demonstrably meaningful through an exact binomial test or a
chi-squared text, cannot represent the distribution of competing evidence
adequately. If a .75 confidence level is acceptable, then two identical gene
trees provide a working hypothesis of species phylogeny but this could only
support, not contradict a morphological tree or branch since molecular data
provide only a "best" choice based on a very small sample of a far
greater, heterogeneous data set. Finally, morphological and gene trees can be
compared on the basis of their aggregate probabilities, that is the product
of multiplying the probabilities of each of the interior branches of the
morphological tree compared with the probability attained by exact binomial,
chi-squared or Bayes formula for however many gene trees are available.
I appreciate the comments of J.
Lyons-Weiler on a draft of this paper, and the appraisals of two anonymous
reviewers.
Avise, J. C. (1994).
"Molecular markers, natural history and evolution." New York. Bremer, K. (1988.) The limits
of amino acid sequence data in angiosperm phylogenetic reconstruction. Bremer, K. (1994.) Branch
support and tree stability. Doyle, J. J. (1992.) Gene
trees and species trees: molecular systematics as one-character taxonomy. Eubanks, M. W. (1999.)
Comparative analysis of the genetics of Goodman, M. (1963.) Man’s
place in the phylogeny of primates as reflected in serum proteins. Goodman, M., Koop, B. F.,
Czelusniak, J., Fitch. D. H. A., Tagle, D. A., and Slightom, J. L. (1989.)
Molecular phylogeny of the family of apes and humans. Hayasaka, K., Gojobori, T.,
and Horai, S.. (1988.) Molecular phylogeny and evolution of primate
mitochondrial DNA. Humphries, W. (1968.)
"Anomalies and Scientific Theories." San Francisco. Kishino, H., and Hasegawa, M. (1989.)
Evolution of the maximum likelihood estimate of the evolutionary tree
topologies from DNA sequence data, and the branching order in Hominoidea. Kluge, A. G. (1989.) A concern
for evidence and a phylogenetic hypothesis of relationships among Lowry, R. (2000.)
"Vassarstats: Web Site For Statistical Computation." Department of
Psychology, Vassar College, Poughkeepsie, New York. Jan. 25, 2000. Lyons-Weiler, J., and
Milinkovitch, M. C. (1997.) A phylogenetic approach to the problem of
differential lineage sorting. Lyons-Weiler, J., and
Takahashi, K.. (1999.) Branch length heterogeneity leads to non-independent
branch length estimates and can decrease the efficiency of methods of
phylogenetic inference. Maddison, W. (1997.) Gene
trees in species trees. Mau, B., Newton, M. A., and
Larget, B. (1997.) Bayesian phylogenetic inference via Markov chain Monte
Carlo methods. Miyamoto, M. M. (1985.)
Consensus cladograms and general classifications. Miyamoto, M. M., Koop, B. F.,
Slightom, J. L., Goodman, M., and Tennant, M. R. (1988.) Molecular
systematics of higher primates: Genealogical relations and classification. Miyamoto, M. M., Slightom, J.
L., and Goodman, M. (1987.) Phylogenetic relations of humans and African apes
from DNA sequences in the ψη-globin region. Nei, M. (1987.)
"Molecular Evolutionary Genetics." Columbia University Press, New
York. Omland, K. E. (1994.)
Character congruence between a molecular and a morphological phylogeny for
dabbling ducks ( Oxelman, B, Backlund, M., and
Bremer, B. (1999.) Relationships of the Buddlejaceae s. l. investigated using
parsimony, jackknife and branch support analysis of chloroplast Peladeau, N. (1994.)
"SIMSTAT for Windows." Ver. 3.5e. Publ. by the author, Montreal. Pamilo, P., and Nei, M..
(1988.) Relationships between gene trees and species trees. Rice, K. A., Donoghue, M. J.,
and Olmstead, R. G.. (1997.) Analyzing large data sets: Richmond, S. B. (1964.)
"Statistical Analysis." Second Ed. Ronald Press Co., New York. Saitou, N., and Nei, M.
(1986.) The number of nucleotides required to determine the branching order
of three species with special reference to the human-chimpanzee-gorilla
divergence. I. Satta, Y., Klein, J., and
Takahata, N. (2000.) DNA archives and our nearest relative: the trichotomy
problem revisited. Slowinski, J. B., and Page, R.
D. M.. (1999.) How should species phylogenies be inferred from sequence data?
Spiegel, M. R. (1988 Templeton, A. (1986.) Relation
of humans to African apes: A statistical appraisal of diverse types of data. Templeton, A. (1987.)
Nonparametric inference from restriction cleavage sites. Vane-Wright, R. I., Schulz,
S., and Boppré, M. (1992.) The cladistics of Yang, Z. (1996.)
Maximum-likelihood models for combined analyses of multiple sequence data Yang, Z., and Rannala, B.
(1997.) Bayesian phylogenetic inference using DNA sequences: a Markov chain
Monte Carlo method. Yee, M. S. Y. (2000.) Tree
robustness and clade significance. Zander, R. H. (1998a.)
Phylogenetic reconstruction, a critique. Zander, R. H. (1998b.) A
phylogrammatic evolutionary analysis of the moss genus Zander, R. H. (2001.) "Deconstructing
Reconstruction: When are the results of parsimony and statistical
phylogenetic analyses genuine analyses?" Buffalo Museum of Science,
February 12, 2001
http://www.buffalomuseumofscience.org/BOTANYDECON/moweb.htm. Zander, R. H. (2001.) A conditional
probability of reconstruction measure for internal cladogram branches. ______________________________________ Table
1. Probability chart for number of loci and contradictory evidence. Given are
numbers of total gene data sets with various numbers of included sets giving
contradictory results per 4-taxon tree (or for three terminal taxa or for an
interior cladogram branch). Probability values are from an exact binomial
test with probability set at 1/3 (the null is data randomly supporting three
equiprobable alternative trees) for reconstruction of a species tree.
Probability values include those needed for confidence levels at .90, .95 and
.99. ______________________________________
This paper
was declined by editors of two of the Note, September 19,
2003: I thank T. Hedderson for pointing out that chloroplast gene are
inherited as a block, and an analysis of three chloroplast genes is not a
statistical test since they are not independent. The same goes for
mitochondrial genes, which are also inherited as a block. Thus, a valid test
involves nuclear genes. |

<script
type="text/javascript">

var
gaJsHost = (("https:" == document.location.protocol) ?
"https://ssl." : "http://www.");

document.write(unescape("%3Cscript
src='" + gaJsHost + "google-analytics.com/ga.js'
type='text/javascript'%3E%3C/script%3E"));

</script>

<script
type="text/javascript">

try
{

var
pageTracker = _gat._getTracker("UA-3783322-4");

pageTracker._trackPageview();

}
catch(err) {}</script>