|
THE NUMBER OF GENE TREES NECESSARY FOR A |
|
THE NUMBER OF GENE
TREES NECESSARY FOR A PROBABILISTIC RECONSTRUCTION Richard H. Zander Buffalo Museum of
Science, 1020 Humboldt Parkway, Buffalo, NY 14214-1293 U.S.A. (present
address: Summary This
non-parametric method is a modification of the Conditional Probability of
Reconstruction method for gauging reliability of individual branches by
comparing alternative putative branch lengths obtained with nearest neighbor
interchange and recalculation under restraint of a single optimal internal
branch. The CPR uses a chi-squared test on the three alternative branch
lengths and a null of a random distribution of these three lengths. With
fewer than 15 loci, however, one must use an exact binomial calculation and a
probability of 1/3. For probabilistic reconstruction of the species tree
conditional on the data and all methodological assumptions one needs—at
minimum—identical, well supported results from three different loci without
contradiction. This can apply to the whole tree or to lineages or to single
internal branches. Introduction The expectation that DNA data will
provide solutions to previously intractable systematic problems is shared by
many recent authors. This, however, cannot be expected to be the case, at
least in the next very many years. There are major problems with reliability
with the results of molecular (and morphological) phylogenetic analyses, as
reviewed by Omland (1994). This is especially true because the best answer in
non-statistical cladistic analyses is not necessarily much better supported
than the second best, or the third best, which may be rather different
hypotheses (Zander 1998a). This applies to maximum parsimony in that
bootstrapping, although a direct substitute for an exact binomial calculation
or a chi-squared test, is not directional, i.e. not just one but two (or even
three) of the three possible arrangements of any internal branch are
phylogenetically "loaded", and the gene pool is coarsely
heterogeneous due to in part to lineage sorting (different "speciation"
times for genes and species). Also, Bremer Support or the "Decay
Index" (Bremer, 1988, 1994) is a poor measure of branch support in data
sets of very many characters because a branch of length n and Bremer support
of x may have an alternative branch length (through nearest neighbor
interchange) of n minus x, being an alternative, contradictory branch length
that could be comparatively large. These as well as additional problems are
discussed by Oxelman et al. (1999), Rice et al. (1997) and Yee (2000). Likelihood ratios, the standard
measure of support in maximum likelihood studies, cannot be used in
phylogenetic analysis because the likelihoods are already optimizations (Nei,
1987; Yang, 1996). Full Bayesian studies such as Markov chain Monte Carlo analysis
yield posterior probabilities of different alternative trees that appear to
add to 100%, yet these are relative to the sum of the probabilities of the
myriad trees with likelihoods too small to be calculated (Mau et al., 1997;
Yang & Rannala, 1997), while priors for the chance of a mistake that
affects tree estimation in a regularity assumption or sample error are not
calculated (they are assumed to be "uniform" or
"uninformative"). Steel and Penny (2000) summarize additional problems.
Non-parametric tests of reliability that evaluate the whole tree, such as the
Templeton (Templeton 1987) and similar Kishino-Hasagawa (Kishino &
Hasegawa 1989) tests may demonstrate that the best tree is significantly
better than the next best, but many branches of that best tree may be poorly
supported. Even if an analytic method can demonstrate that all alternative
trees have probabilities much lower than that of the optimal tree, because
any one of the alternative trees is fully capable of generating the data set,
they cannot be ignored and the sum of their probabilities must be taken into
account. I have introduced (Zander 2001),
for measuring reliability of internal cladogram branches, a new
non-parametric test (Zander, in press) called the Conditional Probability of Reconstruction
(CPR), which uses chi-squared analysis after nearest neighbor interchange of
an internal branch under constraint. It tests whether the best choice (the
longest branch) of the three possible alternative arrangements for any four
nearest neighbor branches has support (measured as length of branch)
distinguishable from a random distribution of support for the three
alternative branch lengths obtained after nearest neighbor interchange at
some chosen level of confidence. The rationale is similar to that of
likelihood ratio test, which compares the likelihoods of the optimal result
with that of the second most likely result, in this case with the second and
third most likely results. The method is like that of Yee (2000), who used
the similarly non-parametric Templeton signed ranks test to compare each of
the branch lengths of an optimal maximum parsimony tree against that of the
most parsimonious clade inconsistent with it. Yee’s test, however, does not
take into account that there are two shorter alternatives to any one optimal
branch, not one, and these should be demonstrably indistinguishable
themselves from a random distribution of their lengths otherwise other
factors may be involved (e.g., introgression, chance great imbalance of
shorter alternative branch lengths that makes selection of one branch as
"phylogenetically loaded" impossible). Also, the total tree lengths
were compared in Yee's test, which then makes re-optimization of other branch
lengths relevant to that of the analyzed branch. The results of the CPR test (and
indeed all tests) are conditional on the data used, the assumptions employed
(e.g., models, optimality settings), and problems intrinsic to particular
methods of analysis, e.g. branch length heterogeneity in the true tree affecting
branch length estimates because such estimates are optimizations
(Lyons-Weiler & Takahashi, 1999), or intrinsic to evolutionary processes,
e.g. convergence, long-branch attraction, parallelism and introgression
(Avise, 1994; Doyle, 1992; Templeton, 1986). Essentially, one uses CPR
analysis to judge if a branch length estimate is distinguishable from a
random distribution vis-a-vis the two closest alternative branch lengths at a
particular confidence level, not whether the branch is true or not. This
method simply clarifies a parsimony optimization by eliminating the
possibility of random selection of one of three possible alternatives as a
genuine scientific result (cladomancy, or divination by trees). Thus, poor
support (a resolved but poorly supported tree) can be viewed as no different
that no support ( a bush) when internal branch support cannot be
distinguished from a random distribution at a selected level of confidence. Analysis (Zander, in press) of two
published data sets (morphology of the moss Didymodon Zander, 1998b,
and primate mtDNA, Hayasaka et al., 1988) indicated that many of the optimal
internal branches of both morphological and molecular trees can be
indistinguishable from a random distribution at a reasonable confidence level
(.95) vis-a-vis the lengths of the two immediate shorter alternative
branches. (Note that parsimony finds as optimal the shortest tree, but when
comparing support for alternative configurations of internal branches, the
longest branch is optimal.) Although some lineages are well supported
probabilistically, probabilities of individual branch reconstruction when
multiplied gave rather low summary probabilities to the whole tree. Thus,
cladistic analysis, when recouched in terms of statistical analysis (probability)
rather than an optimality criterion (e.g., parsimony) may demonstrate the
bottle half empty. This problem may be addressed (Zander, 2001) in terms of
the philosophical choices of falsificationism (toleration of Type I errors)
versus verificationism (intolerance of such errors), especially important
because there are twice as many Type I errors possible in selecting the
correct lineage among the three possible through nearest neighbor interchange
as there are Type II errors, which are fail safe. The CPR method can also be used to
evaluate the problem of conflicting gene lineages in molecular phylogenetic
analysis, which is also entirely relevant here. For example, there has long
been controversy over the relationships of humans, chimpanzees and gorillas
(Goodman, 1963; Goodman et al., 1989; Hyasaka et al., 1988; Miyamoto et al.,
1987, 1988). A meta-analysis of the Homo-Pan-Gorilla data sets was
recently done by Satta et al. (2000) who surveyed data from the literature
for 45 loci consisting of 46,855 bp. There was conflict between data sets,
with 23 loci supporting the ((Homo Pan) Gorilla) gene tree, 8
that support ((Homo Gorilla) Pan), 8 that support ((Gorilla
Pan) Homo), while 6 support a (Homo Gorilla Pan) trichotomy.
This incongruence was attributed by them to different gene and species
phylogenies associated with the different loci. A CPR analysis (Zander, in
press) treating each gene as a character (Doyle, 1992; Slowinski & Page,
1999), with alternative branch lengths of 23, 8 and 8, provided a probability
of reconstruction of the species tree of .997. A similar high probability
(.999) is obtained when only data sets with bootstrap values greater than 80%
are used. Thus, species evolution can indeed be probabilistically
reconstructed using molecular techniques (conditional on the data and all
involved assumptions). It also implies, however, that, absent this way of
probabilistically identifying genes that actually track species evolution,
the prior probability that the absolute order of an internal branch of a
molecular tree based on a single data set really reflects species evolution
rather than a contrary gene lineage can be on the order of 23/(23+8+8) or
.59. Note that the probability of selecting a species tree based on
particular proportions of results reasonably not being due to chance
(determined by a chi-squared test) is quite different from the probability of
selecting a correct result determined by a simple proportion of evidence
favoring one result divided by the total evidence favoring all results (given
that all evidence, for and against, is equally persuasive) in the absence of
such a test. What then are the fewest number of
genes that must be investigated to probabilistically determine species evolution
in light of contrary but presumably randomly generated gene lineages? Nei et
al. (1998) in a study of optimization and analytic methods in phylogenetics,
concluded that "...it seems to be necessary to examine many
independently inherited genes in the construction of reliable phylogenetic
trees for different organisms" repeating in different words the
conclusions of Pamilo and Nei (1988). Methods Analysis
on a minimal number of gene data sets may be done treating each gene as a
character (Doyle, 1992; Slowinski & PageF, 1999). An expected (null)
number of at least 5 per category is usually considered minimal for
chi-squared studies, implying that 15 well-supported gene trees (5 being the
null or expected number per category if randomly generated) are necessary
given equiprobability. If an exact binomial probability test (Lowry 2000;
Peladeau 1994), however, the number of gene trees needed for a probabilistic
reconstruction may be less than 15. The formula for this test is:
where
n is number of trials (loci studied), k is number of events tested for
(optimal tree), p is probability of one event in any one trial, and q is the
probability that the event will not occur in any one trial. This test asks
"what is the chance of k results out of n trials?" In this case the
calculated probability is for "k or more in n trials" by applying
the formula to all values of k equal to or greater than n. The term "tree" is used
throughout this paper but the discussion will usually equally apply to any
four-taxon unrooted trees, any three nearest neighbor terminal taxa, or to an
interior branch of any tree where the problem can be decomposed into its
simplest form. To simplify calculation in the face of little understanding of
contributory effects, the null hypothesis is that no evolution is occurring
and the three alternative branch lengths (from nearest neighbor interchange
and recalculation via constraint trees) are equiprobable and generated randomly.
With an exact binomial calculation with probability set at 1/3, one can
accept or reject the presence of phylogenetically loaded strong imbalance of
the optimal tree (or branch) versus the two less well supported alternative
trees, i.e., significantly non-random at, e.g., the .95 confidence level. The
exact binomial test used is one-tailed because we want to know the chance of
distribution of a particular gene tree occurring by chance alone, not the
chance of any one of the three arrangements distributed this way by chance
alone—the particular tree of interest is the one generated by maximum
parsimony. There may be strong imbalance
among the two less well supported alternative trees, and if this is suspected
(e.g., in closely related lineages) an exact binomial test at 1/2 probability
(the optimal branch versus all contrary evidence) is not appropriate.
Strong, apparently non-random imbalance among the two branch arrangements
alternative to the optimal arrangement may be common; it is demonstrated in 3
of the interior 9 branches in the CPR analysis (Zander, in press) of primate
mtDNA, while, e.g., Eubanks (1999) reported that the alleles shared by maize
and Tripsacum (>28%) was greater than expected from lineage sorting
alone, and postulated a certain amount of reticulation in the evolution of
maize, Tripsacum and teosinte to explain it. A non-parametric test
that evaluates the degree of significance of the support for one of three
branch arrangements is only valid if the support values for the two shorter
alternative branches are indistinguishable from a random distribution.
Otherwise, one has the equivalent of a three-sided coin loaded on two sides.
The appropriate phylogenetic theory assumes only one side is phylogenetically
loaded as the alternative to the null hypothesis of no loading and random
results. In that lineage sorting can affect
both branch arrangements alternative to the optimal one, the null hypothesis
is that all three arrangements are random and affected equally by lineage
sorting. Thus a 1/3 probability binomial calculation should give
probabilities for identifying the weighting of one arrangement due to shared
characters versus random weighting of the other two due to lineage sorting. Results A list of total numbers of gene
trees needed for a probabilistic reconstruction at various confidence levels
is given in Table 1, by an exact binomial calculation with probability at 1/3
and 1/2. This information is commonly given in more extensive and general
form in appendices to statistics textbooks as tables of "The Binomial
Distribution Function" (e.g., Richmond, 1964). Summarizing Table 1, if a
minimally acceptable probability of reconstruction is considered to be the
standard .95 (that is, these relative numbers of gene trees would appear 1 in
20 times if this data were generated completely randomly) and there is no
sense judging from the gene trees at hand that the support for the two
alternative less well supported branches obtained from nearest neighbor
interchange is much imbalanced, then the number of gene data sets necessary
for probabilistic reconstruction with small data sets is achieved under the
1/3 probability (equiprobability) model when 3 trees are the same, or 5 if 1
of them contradicts, 7 if 2 contradict, 9 if 3 contradict, and so on. Thus,
if one requires a .95 confidence level, one must plan on analysis of at least
three genes for a chance at probabilistic reconstruction of the species
phylogeny. Two alike gene trees allow an .89 probability of reconstruction,
but inasmuch as this corresponds to a 1 in 9.1 chance of having the same
result by chance (compared to 1 in 10 for .90 confidence level, or 1 in 20
for .95), the warnings of elementary textbooks against accepting a confidence
level that appears to be slightly less than a pre-selected level are cogent. Given how common contradictory
trees are in the literature, one must expect to analyze at least three well
supported trees representing gene evolution at three loci using exact
binomial analysis. If no satisfactory result is obtained at first, and data
sets of many loci, possibly as many as the 15 trees required for chi-squared
analysis, must be obtained. Even then, no acceptable confidence may be
attained during any of these analyses. One must assume then that confounding
processes are at work that disallow use of non-parametric tests that select a
single non-random (at a particular level of confidence) category from one or
more categories of otherwise apparently randomly generated values. Bayes’
formula with equal priors (in this case, a simple proportion) cannot be used
in evaluation of probability with small data sets that are samples; a fairly
large sample (30 or more unless a closely normal distribution is somehow
logically demonstrable) is required. It may be that the probability of
reconstruction attained in molecular studies is better than that provided by
morphological analysis, in which case the molecular results are to be
preferred, but one must remember that morphological trees are based on data
sets that are not mere samples of a larger universe of data. Although
morphological data sets are small, they are complete, Bayes' formula for
probability of reconstruction can be used, and it means much more when there
is no contradictory evidence among immediate alternative branch lengths
against reconstruction of a morphological lineage. Discussion The CPR measure described above
simplifies calculation of branch reliability by using a parallel with the
likelihood ratio test. The CPR judges the significance of the longest branch
versus only the two most likely alternative lengths on the logic that if the
longest branch length is significantly better than the next two most likely,
other comparisons in probability calculations are unnecessary. Patristically
distant lineages may also be excluded from probabilistic calculations when an
analyzed internal branch is bounded by a demonstrably well supported branch.
A likelihood ratio-style rationale based on eliminating from consideration
taxa at a considerable patristic distance obviates the need for multinomial
probability calculations. A large tree is decomposed into several smaller
problems by calculating probabilities of reconstruction for individual
interior branches. Saitou and Nei (1986) used a
trinomial to estimate the least number of gene trees needed for a
reconstruction of species evolution. Similar calculations by Pamilo and Nei
(1988) were affected by their concern with evolutionary time between
speciation events, in that "the topological error introduced by sequence
polymorphism in ancestral species is substantial when the evolutionary time
considered is short and when the effective population size is small."
According to these authors, for a probability of .95 of obtaining the species
tree, when very short evolutionary time is involved, such as 0.5 ´
106 yrs, as many as 14 independent loci must be examined, but for
evolution taking place over long periods, such as 2 ´ 106
yrs, as few as 3 are needed (note that 3 alike trees would have a confidence
level of .96 at 1/3 probability with an exact test). They limited their
theoretical study to only three species, however, because they reasoned that
multinomial distributions would be necessary for larger numbers of genes, and
they used a trinomial formula that involved estimates of population size and
evolutionary time between speciation events for their calculations. Their
analysis is overly complex and involved too many estimated parameters for a
much needed rule-of-thumb for reckoning the number of genes needed for probabilistic
reconstruction. The three different arrangements
due to nearest neighbor interchange on a cladogram are not equivalent to the
three alternative gene trees (two of which have the same topology as the
species tree), and conflict may be due to lineage sorting and to other
processes, such as introgression and hybridization, mistaken orthology,
horizontal transfer (including hybridization), gene duplication and
extinction (Avise, 1994; Doyle, 1992; Maddison, 1997; Templeton, 1986). It
has been demonstrated (Eernisse & Kluge, 1993; Sanderson & Donoghue,
1989) many times that molecular as well as morphological data may be
homoplastic, involving apparent evolutionary convergence and reversal. The literature is replete with
solutions derived from combining conflicting data sets (discussions by Kluge,
1989; Miyamoto, 1985; Omland, 1994; Vane-Wright et al. 1992). Given that the
main problem is thought to be conflicts in gene lineages (versus species
evolution), and given that gene lineages that contradict species lineages can
theoretically be very well supported, then the results can be totally random,
or the most well supported tree may simply overwhelm another, or multiple
genes may undergo identical lineage sorting, or mixed signals may be wrongly
evaluated as low signal (Lyons-Weiler & Milinkovitch 1997). Solutions
attained this way are specious because data is optimized generally (though
not in the needed, critical detail) and cannot be rejected off-hand.
Humphries (1968: 195) has pointed out that a rule of total evidence is only
adequate if genuine statistical laws demonstrate the relevance of the
evidence. Thus, one should not combine data sets unless one is sure they were
generated by the same evolutionary process. In light of the level of
contradictory results in phylogenetic analysis, this cannot simply be assumed
but must be tested for. Comparing one morphological tree
and one gene tree is inappropriate given expected variation in gene tree
analyses of the same taxon. If a confidence level for meaningful results is
set at, say, .95, then any probabilities obtained by non-parametric analysis
less than that are not meaningful, such as many of the branch probabilities
given by Yee (2000). Bayes’ Formula with uniform priors (equivalent to a
simple proportion) may be used for morphological data sets because the three
alternative branch lengths comprise a sample that is equal to the sample
space, and can be used for molecular data sets when the three alternative
branch lengths add to 30 or more, these being sufficient to be very closely
normal in distribution (Spiegel 1988: 176), and matching the binomial
distribution of the much larger sample space. Total samples less than 30,
however, if not demonstrably meaningful through an exact binomial test or a
chi-squared text, cannot represent the distribution of competing evidence
adequately. If a .75 confidence level is acceptable, then two identical gene
trees provide a working hypothesis of species phylogeny but this could only
support, not contradict a morphological tree or branch since molecular data
provide only a "best" choice based on a very small sample of a far
greater, heterogeneous data set. Finally, morphological and gene trees can be
compared on the basis of their aggregate probabilities, that is the product
of multiplying the probabilities of each of the interior branches of the
morphological tree compared with the probability attained by exact binomial,
chi-squared or Bayes formula for however many gene trees are available. Acknowledgements I appreciate the comments of J.
Lyons-Weiler on a draft of this paper, and the appraisals of two anonymous
reviewers. References Avise, J. C. (1994).
"Molecular markers, natural history and evolution." New York. Bremer, K. (1988.) The limits
of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution
42: 796–803. Bremer, K. (1994.) Branch
support and tree stability. Cladistics 10: 295–304. Doyle, J. J. (1992.) Gene
trees and species trees: molecular systematics as one-character taxonomy. Syst.
Bot. 17: 144–163. Eubanks, M. W. (1999.)
Comparative analysis of the genetics of Zea and Tripsacum. Maize
Genetics Cooperation Newsletter 73: 30–32. Goodman, M. (1963.) Man’s
place in the phylogeny of primates as reflected in serum proteins. In
"Classification and Human Evolution." (S. L. Washburn, ed.), pp.
204–234. Aldine, Chicago. Goodman, M., Koop, B. F.,
Czelusniak, J., Fitch. D. H. A., Tagle, D. A., and Slightom, J. L. (1989.)
Molecular phylogeny of the family of apes and humans. Genome 31:
316–335. Hayasaka, K., Gojobori, T.,
and Horai, S.. (1988.) Molecular phylogeny and evolution of primate
mitochondrial DNA. Mol. Biol. Evol. 5: 626–644. Humphries, W. (1968.)
"Anomalies and Scientific Theories." San Francisco. Kishino, H., and Hasegawa, M.
(1989.) Evolution of the maximum likelihood estimate of the evolutionary tree
topologies from DNA sequence data, and the branching order in Hominoidea. J.
Mol. Evol. 29: 170–179. Kluge, A. G. (1989.) A concern
for evidence and a phylogenetic hypothesis of relationships among Epicrates
(Boidae, Serpentes). Syst. Zool. 38: 7–25. Lowry, R. (2000.)
"Vassarstats: Web Site For Statistical Computation." Department of
Psychology, Vassar College, Poughkeepsie, New York. Jan. 25, 2000. http://faculty.vassar.edu/~lowry/VassarStats.html Lyons-Weiler, J., and
Milinkovitch, M. C. (1997.) A phylogenetic approach to the problem of
differential lineage sorting. Mol. Biol. Evol. 14: 968–975. Lyons-Weiler, J., and Takahashi,
K.. (1999.) Branch length heterogeneity leads to non-independent branch
length estimates and can decrease the efficiency of methods of phylogenetic
inference. J. Mol. Evol. 49: 392–405. Maddison, W. (1997.) Gene
trees in species trees. Syst. Biol. 46: 523–536. Mau, B., Newton, M. A., and
Larget, B. (1997.) Bayesian phylogenetic inference via Markov chain Monte
Carlo methods. Mol. Biol. Evol. 14: 717–724. Miyamoto, M. M. (1985.)
Consensus cladograms and general classifications. Cladistics 1: 186–189.
Miyamoto, M. M., Koop, B. F.,
Slightom, J. L., Goodman, M., and Tennant, M. R. (1988.) Molecular
systematics of higher primates: Genealogical relations and classification. Proc.
Natl. Acad. Sci. USA 85: 7627–7631. Miyamoto, M. M., Slightom, J.
L., and Goodman, M. (1987.) Phylogenetic relations of humans and African apes
from DNA sequences in the ψη-globin region. Science 238: 369–373. Nei, M. (1987.)
"Molecular Evolutionary Genetics." Columbia University Press, New
York. Omland, K. E. (1994.)
Character congruence between a molecular and a morphological phylogeny for
dabbling ducks (Anas). Syst. Biol. 43: 369–386. Oxelman, B, Backlund, M., and
Bremer, B. (1999.) Relationships of the Buddlejaceae s. l. investigated using
parsimony, jackknife and branch support analysis of chloroplast ndhF
and rbcL sequence data. Syst. Bot. 24: 164–182. Peladeau, N. (1994.)
"SIMSTAT for Windows." Ver. 3.5e. Publ. by the author, Montreal. Pamilo, P., and Nei, M..
(1988.) Relationships between gene trees and species trees. Mol. Biol.
Evol. 5: 568–583. Rice, K. A., Donoghue, M. J.,
and Olmstead, R. G.. (1997.) Analyzing large data sets: rbcL 500
revisited. Syst. Biol. 46: 554–563. Richmond, S. B. (1964.)
"Statistical Analysis." Second Ed. Ronald Press Co., New York. Saitou, N., and Nei, M.
(1986.) The number of nucleotides required to determine the branching order
of three species with special reference to the human-chimpanzee-gorilla
divergence. I. Mol. Evol. 24: 189–204. Satta, Y., Klein, J., and
Takahata, N. (2000.) DNA archives and our nearest relative: the trichotomy
problem revisited. Mol. Phylog. Evol. 14: 259–275. Slowinski, J. B., and Page, R.
D. M.. (1999.) How should species phylogenies be inferred from sequence data?
Syst. Biol. 48: 814–825. Spiegel, M. R. (1988.)
"Schaum’s Outline of Theory and Problems of Statistics."
Edition 2. McGraw-Hill, New York. Templeton, A. (1986.) Relation
of humans to African apes: A statistical appraisal of diverse types of data. In
"Evolutionary Processes and Theory" (S. Karlin, and Nevo, E., Eds),
pp. 365–388. Academic Press: Orlando, Florida. Templeton, A. (1987.)
Nonparametric inference from restriction cleavage sites. Mol. Biol. Evol.
4: 315–319. Vane-Wright, R. I., Schulz,
S., and Boppré, M. (1992.) The cladistics of Amauris Butterflies:
Congruence, consensus and total evidence. Cladistics 8: 125–138. Yang, Z. (1996.)
Maximum-likelihood models for combined analyses of multiple sequence data.
J. Mol. Evol. 42: 587–596. Yang, Z., and Rannala, B.
(1997.) Bayesian phylogenetic inference using DNA sequences: a Markov chain
Monte Carlo method. Mol. Biol. Evol. 14: 717–724. Yee, M. S. Y. (2000.) Tree
robustness and clade significance. Syst. Biol. 49: 829–836. Zander, R. H. (1998a.)
Phylogenetic reconstruction, a critique. Taxon 47: 681–693. Zander, R. H. (1998b.) A
phylogrammatic evolutionary analysis of the moss genus Didymodon in
North America north of Mexico. Bull. Buffalo Soc. Nat. Sci. 36:
81–115. Zander, R. H. (2001.)
"Deconstructing Reconstruction: When are the results of parsimony and
statistical phylogenetic analyses genuine analyses?" Buffalo Museum of
Science, February 12, 2001
http://www.buffalomuseumofscience.org/BOTANYDECON/moweb.htm. Zander, R. H. (2001.) A
conditional probability of reconstruction measure for internal cladogram
branches. Syst. Biol. 50: 425–437. ______________________________________ Table
1. Probability chart for number of loci and contradictory evidence. Given are
numbers of total gene data sets with various numbers of included sets giving
contradictory results per 4-taxon tree (or for three terminal taxa or for an
interior cladogram branch). Probability values are from an exact binomial
test with probability set at 1/3 (the null is data randomly supporting three
equiprobable alternative trees) for reconstruction of a species tree.
Probability values include those needed for confidence levels at .90, .95 and
.99. ______________________________________ Total loci Included contrary loci Probability at 1/3 1 0 .33 2 0 .89 2 1 .45 3 0 .96 3 1 .74 4 0 .99 4 1 .89 5 0 .99 5 1 .96 5 2 .79 6 0 .99 6 1 .98 6 2 .90 7 0 .99 7 1 .99 7 2 .96 7 3 .82 8 0 .99 8 1 .99 8 2 .98 8 3 .91 9 2 .99 9 3 .96 10 4 .92 11 1 .99 11 3 .99 11 4 .96 12 3 .99 13 3 .99 13 4 .99 14 4 .99 16 4 .99 17 3 .99 19 4 .99 ______________________________________
Explanation
of the paper's history: This paper was
declined by editors of two of the very best journals. One indicated
that the article was "too technical" for its readership. The second
replied that its two reviewers, whose comments were given to me, agreed that
it was "too obvious." Rather than search about for a journal with
subscribers of intermediate sophistication in phylogenetic analysis, I
present it here on this Web site. Except for the discovery of the usual minor
faults, which were corrected, there were no complaints (otherwise) of
substance from the reviewers. Note, September 19,
2003: I thank T. Hedderson for pointing out that chloroplast gene are
inherited as a block, and an analysis of three chloroplast genes is not a
statistical test since they are not independent. The same goes for
mitochondrial genes, which are also inherited as a block. Thus, a valid test involves
nuclear genes. |