THE NUMBER OF GENE TREES NECESSARY FOR A
PROBABILISTIC RECONSTRUCTION
OF THE SPECIES TREE
Richard H. Zander
Buffalo Museum of Science
1020 Humboldt Parkway
Buffalo, NY 14214-1293
www.buffalomuseumofscience.org
Sept. 11, 2001

Return to Home

Posting history and a note

 

THE NUMBER OF GENE TREES NECESSARY FOR A PROBABILISTIC RECONSTRUCTION
OF THE SPECIES TREE

Richard H. Zander

Buffalo Museum of Science, 1020 Humboldt Parkway, Buffalo, NY 14214-1293 U.S.A. (present address:
Missouri Botanical Garden, P.O. Box 299,
St. Louis MO 63166-0299
Email: richard.zander@mobot.org)

Summary

This non-parametric method is a modification of the Conditional Probability of Reconstruction method for gauging reliability of individual branches by comparing alternative putative branch lengths obtained with nearest neighbor interchange and recalculation under restraint of a single optimal internal branch. The CPR uses a chi-squared test on the three alternative branch lengths and a null of a random distribution of these three lengths. With fewer than 15 loci, however, one must use an exact binomial calculation and a probability of 1/3. For probabilistic reconstruction of the species tree conditional on the data and all methodological assumptions one needs—at minimum—identical, well supported results from three different loci without contradiction. This can apply to the whole tree or to lineages or to single internal branches.

Introduction

            The expectation that DNA data will provide solutions to previously intractable systematic problems is shared by many recent authors. This, however, cannot be expected to be the case, at least in the next very many years. There are major problems with reliability with the results of molecular (and morphological) phylogenetic analyses, as reviewed by Omland (1994). This is especially true because the best answer in non-statistical cladistic analyses is not necessarily much better supported than the second best, or the third best, which may be rather different hypotheses (Zander 1998a). This applies to maximum parsimony in that bootstrapping, although a direct substitute for an exact binomial calculation or a chi-squared test, is not directional, i.e. not just one but two (or even three) of the three possible arrangements of any internal branch are phylogenetically "loaded", and the gene pool is coarsely heterogeneous due to in part to lineage sorting (different "speciation" times for genes and species). Also, Bremer Support or the "Decay Index" (Bremer, 1988, 1994) is a poor measure of branch support in data sets of very many characters because a branch of length n and Bremer support of x may have an alternative branch length (through nearest neighbor interchange) of n minus x, being an alternative, contradictory branch length that could be comparatively large. These as well as additional problems are discussed by Oxelman et al. (1999), Rice et al. (1997) and Yee (2000).

            Likelihood ratios, the standard measure of support in maximum likelihood studies, cannot be used in phylogenetic analysis because the likelihoods are already optimizations (Nei, 1987; Yang, 1996). Full Bayesian studies such as Markov chain Monte Carlo analysis yield posterior probabilities of different alternative trees that appear to add to 100%, yet these are relative to the sum of the probabilities of the myriad trees with likelihoods too small to be calculated (Mau et al., 1997; Yang & Rannala, 1997), while priors for the chance of a mistake that affects tree estimation in a regularity assumption or sample error are not calculated (they are assumed to be "uniform" or "uninformative"). Steel and Penny (2000) summarize additional problems. Non-parametric tests of reliability that evaluate the whole tree, such as the Templeton (Templeton 1987) and similar Kishino-Hasagawa (Kishino & Hasegawa 1989) tests may demonstrate that the best tree is significantly better than the next best, but many branches of that best tree may be poorly supported. Even if an analytic method can demonstrate that all alternative trees have probabilities much lower than that of the optimal tree, because any one of the alternative trees is fully capable of generating the data set, they cannot be ignored and the sum of their probabilities must be taken into account.

            I have introduced (Zander 2001), for measuring reliability of internal cladogram branches, a new non-parametric test (Zander, in press) called the Conditional Probability of Reconstruction (CPR), which uses chi-squared analysis after nearest neighbor interchange of an internal branch under constraint. It tests whether the best choice (the longest branch) of the three possible alternative arrangements for any four nearest neighbor branches has support (measured as length of branch) distinguishable from a random distribution of support for the three alternative branch lengths obtained after nearest neighbor interchange at some chosen level of confidence. The rationale is similar to that of likelihood ratio test, which compares the likelihoods of the optimal result with that of the second most likely result, in this case with the second and third most likely results. The method is like that of Yee (2000), who used the similarly non-parametric Templeton signed ranks test to compare each of the branch lengths of an optimal maximum parsimony tree against that of the most parsimonious clade inconsistent with it. Yee’s test, however, does not take into account that there are two shorter alternatives to any one optimal branch, not one, and these should be demonstrably indistinguishable themselves from a random distribution of their lengths otherwise other factors may be involved (e.g., introgression, chance great imbalance of shorter alternative branch lengths that makes selection of one branch as "phylogenetically loaded" impossible). Also, the total tree lengths were compared in Yee's test, which then makes re-optimization of other branch lengths relevant to that of the analyzed branch.

            The results of the CPR test (and indeed all tests) are conditional on the data used, the assumptions employed (e.g., models, optimality settings), and problems intrinsic to particular methods of analysis, e.g. branch length heterogeneity in the true tree affecting branch length estimates because such estimates are optimizations (Lyons-Weiler & Takahashi, 1999), or intrinsic to evolutionary processes, e.g. convergence, long-branch attraction, parallelism and introgression (Avise, 1994; Doyle, 1992; Templeton, 1986). Essentially, one uses CPR analysis to judge if a branch length estimate is distinguishable from a random distribution vis-a-vis the two closest alternative branch lengths at a particular confidence level, not whether the branch is true or not. This method simply clarifies a parsimony optimization by eliminating the possibility of random selection of one of three possible alternatives as a genuine scientific result (cladomancy, or divination by trees). Thus, poor support (a resolved but poorly supported tree) can be viewed as no different that no support ( a bush) when internal branch support cannot be distinguished from a random distribution at a selected level of confidence.

            Analysis (Zander, in press) of two published data sets (morphology of the moss Didymodon Zander, 1998b, and primate mtDNA, Hayasaka et al., 1988) indicated that many of the optimal internal branches of both morphological and molecular trees can be indistinguishable from a random distribution at a reasonable confidence level (.95) vis-a-vis the lengths of the two immediate shorter alternative branches. (Note that parsimony finds as optimal the shortest tree, but when comparing support for alternative configurations of internal branches, the longest branch is optimal.) Although some lineages are well supported probabilistically, probabilities of individual branch reconstruction when multiplied gave rather low summary probabilities to the whole tree. Thus, cladistic analysis, when recouched in terms of statistical analysis (probability) rather than an optimality criterion (e.g., parsimony) may demonstrate the bottle half empty. This problem may be addressed (Zander, 2001) in terms of the philosophical choices of falsificationism (toleration of Type I errors) versus verificationism (intolerance of such errors), especially important because there are twice as many Type I errors possible in selecting the correct lineage among the three possible through nearest neighbor interchange as there are Type II errors, which are fail safe.

            The CPR method can also be used to evaluate the problem of conflicting gene lineages in molecular phylogenetic analysis, which is also entirely relevant here. For example, there has long been controversy over the relationships of humans, chimpanzees and gorillas (Goodman, 1963; Goodman et al., 1989; Hyasaka et al., 1988; Miyamoto et al., 1987, 1988). A meta-analysis of the Homo-Pan-Gorilla data sets was recently done by Satta et al. (2000) who surveyed data from the literature for 45 loci consisting of 46,855 bp. There was conflict between data sets, with 23 loci supporting the ((Homo Pan) Gorilla) gene tree, 8 that support ((Homo Gorilla) Pan), 8 that support ((Gorilla Pan) Homo), while 6 support a (Homo Gorilla Pan) trichotomy. This incongruence was attributed by them to different gene and species phylogenies associated with the different loci. A CPR analysis (Zander, in press) treating each gene as a character (Doyle, 1992; Slowinski & Page, 1999), with alternative branch lengths of 23, 8 and 8, provided a probability of reconstruction of the species tree of .997. A similar high probability (.999) is obtained when only data sets with bootstrap values greater than 80% are used. Thus, species evolution can indeed be probabilistically reconstructed using molecular techniques (conditional on the data and all involved assumptions). It also implies, however, that, absent this way of probabilistically identifying genes that actually track species evolution, the prior probability that the absolute order of an internal branch of a molecular tree based on a single data set really reflects species evolution rather than a contrary gene lineage can be on the order of 23/(23+8+8) or .59. Note that the probability of selecting a species tree based on particular proportions of results reasonably not being due to chance (determined by a chi-squared test) is quite different from the probability of selecting a correct result determined by a simple proportion of evidence favoring one result divided by the total evidence favoring all results (given that all evidence, for and against, is equally persuasive) in the absence of such a test.

            What then are the fewest number of genes that must be investigated to probabilistically determine species evolution in light of contrary but presumably randomly generated gene lineages? Nei et al. (1998) in a study of optimization and analytic methods in phylogenetics, concluded that "...it seems to be necessary to examine many independently inherited genes in the construction of reliable phylogenetic trees for different organisms" repeating in different words the conclusions of Pamilo and Nei (1988).

Methods

Analysis on a minimal number of gene data sets may be done treating each gene as a character (Doyle, 1992; Slowinski & PageF, 1999). An expected (null) number of at least 5 per category is usually considered minimal for chi-squared studies, implying that 15 well-supported gene trees (5 being the null or expected number per category if randomly generated) are necessary given equiprobability. If an exact binomial probability test (Lowry 2000; Peladeau 1994), however, the number of gene trees needed for a probabilistic reconstruction may be less than 15. The formula for this test is:

 

where n is number of trials (loci studied), k is number of events tested for (optimal tree), p is probability of one event in any one trial, and q is the probability that the event will not occur in any one trial. This test asks "what is the chance of k results out of n trials?" In this case the calculated probability is for "k or more in n trials" by applying the formula to all values of k equal to or greater than n.

            The term "tree" is used throughout this paper but the discussion will usually equally apply to any four-taxon unrooted trees, any three nearest neighbor terminal taxa, or to an interior branch of any tree where the problem can be decomposed into its simplest form. To simplify calculation in the face of little understanding of contributory effects, the null hypothesis is that no evolution is occurring and the three alternative branch lengths (from nearest neighbor interchange and recalculation via constraint trees) are equiprobable and generated randomly. With an exact binomial calculation with probability set at 1/3, one can accept or reject the presence of phylogenetically loaded strong imbalance of the optimal tree (or branch) versus the two less well supported alternative trees, i.e., significantly non-random at, e.g., the .95 confidence level. The exact binomial test used is one-tailed because we want to know the chance of distribution of a particular gene tree occurring by chance alone, not the chance of any one of the three arrangements distributed this way by chance alone—the particular tree of interest is the one generated by maximum parsimony.

            There may be strong imbalance among the two less well supported alternative trees, and if this is suspected (e.g., in closely related lineages) an exact binomial test at 1/2 probability (the optimal branch versus all contrary evidence) is not appropriate. Strong, apparently non-random imbalance among the two branch arrangements alternative to the optimal arrangement may be common; it is demonstrated in 3 of the interior 9 branches in the CPR analysis (Zander, in press) of primate mtDNA, while, e.g., Eubanks (1999) reported that the alleles shared by maize and Tripsacum (>28%) was greater than expected from lineage sorting alone, and postulated a certain amount of reticulation in the evolution of maize, Tripsacum and teosinte to explain it. A non-parametric test that evaluates the degree of significance of the support for one of three branch arrangements is only valid if the support values for the two shorter alternative branches are indistinguishable from a random distribution. Otherwise, one has the equivalent of a three-sided coin loaded on two sides. The appropriate phylogenetic theory assumes only one side is phylogenetically loaded as the alternative to the null hypothesis of no loading and random results.

            In that lineage sorting can affect both branch arrangements alternative to the optimal one, the null hypothesis is that all three arrangements are random and affected equally by lineage sorting. Thus a 1/3 probability binomial calculation should give probabilities for identifying the weighting of one arrangement due to shared characters versus random weighting of the other two due to lineage sorting.

Results

            A list of total numbers of gene trees needed for a probabilistic reconstruction at various confidence levels is given in Table 1, by an exact binomial calculation with probability at 1/3 and 1/2. This information is commonly given in more extensive and general form in appendices to statistics textbooks as tables of "The Binomial Distribution Function" (e.g., Richmond, 1964). Summarizing Table 1, if a minimally acceptable probability of reconstruction is considered to be the standard .95 (that is, these relative numbers of gene trees would appear 1 in 20 times if this data were generated completely randomly) and there is no sense judging from the gene trees at hand that the support for the two alternative less well supported branches obtained from nearest neighbor interchange is much imbalanced, then the number of gene data sets necessary for probabilistic reconstruction with small data sets is achieved under the 1/3 probability (equiprobability) model when 3 trees are the same, or 5 if 1 of them contradicts, 7 if 2 contradict, 9 if 3 contradict, and so on. Thus, if one requires a .95 confidence level, one must plan on analysis of at least three genes for a chance at probabilistic reconstruction of the species phylogeny. Two alike gene trees allow an .89 probability of reconstruction, but inasmuch as this corresponds to a 1 in 9.1 chance of having the same result by chance (compared to 1 in 10 for .90 confidence level, or 1 in 20 for .95), the warnings of elementary textbooks against accepting a confidence level that appears to be slightly less than a pre-selected level are cogent.

            Given how common contradictory trees are in the literature, one must expect to analyze at least three well supported trees representing gene evolution at three loci using exact binomial analysis. If no satisfactory result is obtained at first, and data sets of many loci, possibly as many as the 15 trees required for chi-squared analysis, must be obtained. Even then, no acceptable confidence may be attained during any of these analyses. One must assume then that confounding processes are at work that disallow use of non-parametric tests that select a single non-random (at a particular level of confidence) category from one or more categories of otherwise apparently randomly generated values. Bayes’ formula with equal priors (in this case, a simple proportion) cannot be used in evaluation of probability with small data sets that are samples; a fairly large sample (30 or more unless a closely normal distribution is somehow logically demonstrable) is required. It may be that the probability of reconstruction attained in molecular studies is better than that provided by morphological analysis, in which case the molecular results are to be preferred, but one must remember that morphological trees are based on data sets that are not mere samples of a larger universe of data. Although morphological data sets are small, they are complete, Bayes' formula for probability of reconstruction can be used, and it means much more when there is no contradictory evidence among immediate alternative branch lengths against reconstruction of a morphological lineage.

Discussion

            The CPR measure described above simplifies calculation of branch reliability by using a parallel with the likelihood ratio test. The CPR judges the significance of the longest branch versus only the two most likely alternative lengths on the logic that if the longest branch length is significantly better than the next two most likely, other comparisons in probability calculations are unnecessary. Patristically distant lineages may also be excluded from probabilistic calculations when an analyzed internal branch is bounded by a demonstrably well supported branch. A likelihood ratio-style rationale based on eliminating from consideration taxa at a considerable patristic distance obviates the need for multinomial probability calculations. A large tree is decomposed into several smaller problems by calculating probabilities of reconstruction for individual interior branches.     

            Saitou and Nei (1986) used a trinomial to estimate the least number of gene trees needed for a reconstruction of species evolution. Similar calculations by Pamilo and Nei (1988) were affected by their concern with evolutionary time between speciation events, in that "the topological error introduced by sequence polymorphism in ancestral species is substantial when the evolutionary time considered is short and when the effective population size is small." According to these authors, for a probability of .95 of obtaining the species tree, when very short evolutionary time is involved, such as 0.5 ´ 106 yrs, as many as 14 independent loci must be examined, but for evolution taking place over long periods, such as 2 ´ 106 yrs, as few as 3 are needed (note that 3 alike trees would have a confidence level of .96 at 1/3 probability with an exact test). They limited their theoretical study to only three species, however, because they reasoned that multinomial distributions would be necessary for larger numbers of genes, and they used a trinomial formula that involved estimates of population size and evolutionary time between speciation events for their calculations. Their analysis is overly complex and involved too many estimated parameters for a much needed rule-of-thumb for reckoning the number of genes needed for probabilistic reconstruction.

            The three different arrangements due to nearest neighbor interchange on a cladogram are not equivalent to the three alternative gene trees (two of which have the same topology as the species tree), and conflict may be due to lineage sorting and to other processes, such as introgression and hybridization, mistaken orthology, horizontal transfer (including hybridization), gene duplication and extinction (Avise, 1994; Doyle, 1992; Maddison, 1997; Templeton, 1986). It has been demonstrated (Eernisse & Kluge, 1993; Sanderson & Donoghue, 1989) many times that molecular as well as morphological data may be homoplastic, involving apparent evolutionary convergence and reversal.

            The literature is replete with solutions derived from combining conflicting data sets (discussions by Kluge, 1989; Miyamoto, 1985; Omland, 1994; Vane-Wright et al. 1992). Given that the main problem is thought to be conflicts in gene lineages (versus species evolution), and given that gene lineages that contradict species lineages can theoretically be very well supported, then the results can be totally random, or the most well supported tree may simply overwhelm another, or multiple genes may undergo identical lineage sorting, or mixed signals may be wrongly evaluated as low signal (Lyons-Weiler & Milinkovitch 1997). Solutions attained this way are specious because data is optimized generally (though not in the needed, critical detail) and cannot be rejected off-hand. Humphries (1968: 195) has pointed out that a rule of total evidence is only adequate if genuine statistical laws demonstrate the relevance of the evidence. Thus, one should not combine data sets unless one is sure they were generated by the same evolutionary process. In light of the level of contradictory results in phylogenetic analysis, this cannot simply be assumed but must be tested for.

            Comparing one morphological tree and one gene tree is inappropriate given expected variation in gene tree analyses of the same taxon. If a confidence level for meaningful results is set at, say, .95, then any probabilities obtained by non-parametric analysis less than that are not meaningful, such as many of the branch probabilities given by Yee (2000). Bayes’ Formula with uniform priors (equivalent to a simple proportion) may be used for morphological data sets because the three alternative branch lengths comprise a sample that is equal to the sample space, and can be used for molecular data sets when the three alternative branch lengths add to 30 or more, these being sufficient to be very closely normal in distribution (Spiegel 1988: 176), and matching the binomial distribution of the much larger sample space. Total samples less than 30, however, if not demonstrably meaningful through an exact binomial test or a chi-squared text, cannot represent the distribution of competing evidence adequately. If a .75 confidence level is acceptable, then two identical gene trees provide a working hypothesis of species phylogeny but this could only support, not contradict a morphological tree or branch since molecular data provide only a "best" choice based on a very small sample of a far greater, heterogeneous data set. Finally, morphological and gene trees can be compared on the basis of their aggregate probabilities, that is the product of multiplying the probabilities of each of the interior branches of the morphological tree compared with the probability attained by exact binomial, chi-squared or Bayes formula for however many gene trees are available.

Acknowledgements

            I appreciate the comments of J. Lyons-Weiler on a draft of this paper, and the appraisals of two anonymous reviewers.

References

Avise, J. C. (1994). "Molecular markers, natural history and evolution." New York.

Bremer, K. (1988.) The limits of amino acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42: 796–803.

Bremer, K. (1994.) Branch support and tree stability. Cladistics 10: 295–304.

Doyle, J. J. (1992.) Gene trees and species trees: molecular systematics as one-character taxonomy. Syst. Bot. 17: 144–163.

Eubanks, M. W. (1999.) Comparative analysis of the genetics of Zea and Tripsacum. Maize Genetics Cooperation Newsletter 73: 30–32.

Goodman, M. (1963.) Man’s place in the phylogeny of primates as reflected in serum proteins. In "Classification and Human Evolution." (S. L. Washburn, ed.), pp. 204–234. Aldine, Chicago.

Goodman, M., Koop, B. F., Czelusniak, J., Fitch. D. H. A., Tagle, D. A., and Slightom, J. L. (1989.) Molecular phylogeny of the family of apes and humans. Genome 31: 316–335.

Hayasaka, K., Gojobori, T., and Horai, S.. (1988.) Molecular phylogeny and evolution of primate mitochondrial DNA. Mol. Biol. Evol. 5: 626–644.

Humphries, W. (1968.) "Anomalies and Scientific Theories." San Francisco.

Kishino, H., and Hasegawa, M. (1989.) Evolution of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol. 29: 170–179.

Kluge, A. G. (1989.) A concern for evidence and a phylogenetic hypothesis of relationships among Epicrates (Boidae, Serpentes). Syst. Zool. 38: 7–25.

Lowry, R. (2000.) "Vassarstats: Web Site For Statistical Computation." Department of Psychology, Vassar College, Poughkeepsie, New York. Jan. 25, 2000. http://faculty.vassar.edu/~lowry/VassarStats.html

Lyons-Weiler, J., and Milinkovitch, M. C. (1997.) A phylogenetic approach to the problem of differential lineage sorting. Mol. Biol. Evol. 14: 968–975.

Lyons-Weiler, J., and Takahashi, K.. (1999.) Branch length heterogeneity leads to non-independent branch length estimates and can decrease the efficiency of methods of phylogenetic inference. J. Mol. Evol. 49: 392–405.

Maddison, W. (1997.) Gene trees in species trees. Syst. Biol. 46: 523–536.

Mau, B., Newton, M. A., and Larget, B. (1997.) Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Mol. Biol. Evol. 14: 717–724.

Miyamoto, M. M. (1985.) Consensus cladograms and general classifications. Cladistics 1: 186–189.

Miyamoto, M. M., Koop, B. F., Slightom, J. L., Goodman, M., and Tennant, M. R. (1988.) Molecular systematics of higher primates: Genealogical relations and classification. Proc. Natl. Acad. Sci. USA 85: 7627–7631.

Miyamoto, M. M., Slightom, J. L., and Goodman, M. (1987.) Phylogenetic relations of humans and African apes from DNA sequences in the ψη-globin region. Science 238: 369–373.

Nei, M. (1987.) "Molecular Evolutionary Genetics." Columbia University Press, New York.

Omland, K. E. (1994.) Character congruence between a molecular and a morphological phylogeny for dabbling ducks (Anas). Syst. Biol. 43: 369–386.

Oxelman, B, Backlund, M., and Bremer, B. (1999.) Relationships of the Buddlejaceae s. l. investigated using parsimony, jackknife and branch support analysis of chloroplast ndhF and rbcL sequence data. Syst. Bot. 24: 164–182.

Peladeau, N. (1994.) "SIMSTAT for Windows." Ver. 3.5e. Publ. by the author, Montreal.

Pamilo, P., and Nei, M.. (1988.) Relationships between gene trees and species trees. Mol. Biol. Evol. 5: 568–583.

Rice, K. A., Donoghue, M. J., and Olmstead, R. G.. (1997.) Analyzing large data sets: rbcL 500 revisited. Syst. Biol. 46: 554–563.

Richmond, S. B. (1964.) "Statistical Analysis." Second Ed. Ronald Press Co., New York.

Saitou, N., and Nei, M. (1986.) The number of nucleotides required to determine the branching order of three species with special reference to the human-chimpanzee-gorilla divergence. I. Mol. Evol. 24: 189–204.

Satta, Y., Klein, J., and Takahata, N. (2000.) DNA archives and our nearest relative: the trichotomy problem revisited. Mol. Phylog. Evol. 14: 259–275.

Slowinski, J. B., and Page, R. D. M.. (1999.) How should species phylogenies be inferred from sequence data? Syst. Biol. 48: 814–825.

Spiegel, M. R. (1988.) "Schaum’s Outline of Theory and Problems of Statistics." Edition 2. McGraw-Hill, New York.

Templeton, A. (1986.) Relation of humans to African apes: A statistical appraisal of diverse types of data. In "Evolutionary Processes and Theory" (S. Karlin, and Nevo, E., Eds), pp. 365–388. Academic Press: Orlando, Florida.

Templeton, A. (1987.) Nonparametric inference from restriction cleavage sites. Mol. Biol. Evol. 4: 315–319.

Vane-Wright, R. I., Schulz, S., and Boppré, M. (1992.) The cladistics of Amauris Butterflies: Congruence, consensus and total evidence. Cladistics 8: 125–138.

Yang, Z. (1996.) Maximum-likelihood models for combined analyses of multiple sequence data. J. Mol. Evol. 42: 587–596.

Yang, Z., and Rannala, B. (1997.) Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol. Biol. Evol. 14: 717–724.

Yee, M. S. Y. (2000.) Tree robustness and clade significance. Syst. Biol. 49: 829–836.

Zander, R. H. (1998a.) Phylogenetic reconstruction, a critique. Taxon 47: 681–693.

Zander, R. H. (1998b.) A phylogrammatic evolutionary analysis of the moss genus Didymodon in North America north of Mexico. Bull. Buffalo Soc. Nat. Sci. 36: 81–115.

Zander, R. H. (2001.) "Deconstructing Reconstruction: When are the results of parsimony and statistical phylogenetic analyses genuine analyses?" Buffalo Museum of Science, February 12, 2001 http://www.buffalomuseumofscience.org/BOTANYDECON/moweb.htm.

Zander, R. H. (2001.) A conditional probability of reconstruction measure for internal cladogram branches. Syst. Biol. 50: 425–437.

______________________________________

Table 1. Probability chart for number of loci and contradictory evidence. Given are numbers of total gene data sets with various numbers of included sets giving contradictory results per 4-taxon tree (or for three terminal taxa or for an interior cladogram branch). Probability values are from an exact binomial test with probability set at 1/3 (the null is data randomly supporting three equiprobable alternative trees) for reconstruction of a species tree. Probability values include those needed for confidence levels at .90, .95 and .99.

______________________________________

Total loci   Included contrary loci                     Probability at 1/3
                    1                                                           0                                                           .33
                    2                                                           0                                                           .89
                    2                                                           1                                                           .45
                    3                                                           0                                                           .96
                    3                                                           1                                                           .74
                    4                                                           0                                                           .99
                    4                                                           1                                                           .89
                    5                                                           0                                                           .99
                    5                                                           1                                                           .96
                    5                                                           2                                                           .79
                    6                                                           0                                                           .99
                    6                                                           1                                                           .98
                    6                                                           2                                                           .90
                    7                                                           0                                                           .99
                    7                                                           1                                                           .99
                    7                                                           2                                                           .96
                    7                                                           3                                                           .82
                    8                                                           0                                                           .99
                    8                                                           1                                                           .99
                    8                                                           2                                                           .98
                    8                                                           3                                                           .91
                    9                                                           2                                                           .99
                    9                                                           3                                                           .96
                    10                                                        4                                                           .92
                    11                                                        1                                                           .99
                    11                                                        3                                                           .99
                    11                                                        4                                                           .96
                    12                                                        3                                                           .99                                        
                    13                                                        3                                                           .99               
                    13                                                        4                                                           .99
                    14                                                        4                                                           .99                                        
                    16                                                        4                                                           .99               
                    17                                                        3                                                           .99               
                    19                                                        4                                                           .99               
______________________________________

 

Explanation of the paper's history:

This paper was declined by editors of two of the very best journals. One indicated that the article was "too technical" for its readership. The second replied that its two reviewers, whose comments were given to me, agreed that it was "too obvious." Rather than search about for a journal with subscribers of intermediate sophistication in phylogenetic analysis, I present it here on this Web site. Except for the discovery of the usual minor faults, which were corrected, there were no complaints (otherwise) of substance from the reviewers.

 

Note, September 19, 2003: I thank T. Hedderson for pointing out that chloroplast gene are inherited as a block, and an analysis of three chloroplast genes is not a statistical test since they are not independent. The same goes for mitochondrial genes, which are also inherited as a block. Thus, a valid test involves nuclear genes.

  

 

 

 

 

<script type="text/javascript">

var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");

document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));

</script>

<script type="text/javascript">

try {

var pageTracker = _gat._getTracker("UA-3783322-4");

pageTracker._trackPageview();

} catch(err) {}</script>