Emerging patterns and emerging challenges of comparative phylogeography

news and update ISSN 1948-6596 commentary Emerging patterns and emerging challenges of comparative phylogeography Biogeographic studies commonly amass distri- bution datasets for hundreds of species in an attempt to describe biogeographic patterns and their underlying processes (e.g., Keith et al. 2013). In contrast, phylogeographic studies are almost always limited to a small number of species and, while able to detect patterns at a population level, lack the replicative power of biogeography to describe those patterns in terms of relevant processes. This is particularly disadvantageous in phylogeographic studies of marine species, because the ocean contains few conspicuous geographic features that might explain population breaks (Horne 2014). Large, comparative phylogeographic datasets have long been the daydream of mo- lecular ecologists but have previously not been possible because the data didn't exist and rig- orous phylogeographic analyses do not scale well with large datasets (Andrew et al. 2013). For instance, coalescence-based phy- logeographic model fitting is able to accurately assess population patterns amid stochastic sig- nals and other noise in the data (Beaumont et al. 2010) but is computationally intensive and can become overwhelmingly time consuming as datasets grow larger and modelled scenarios become more complex. Even while data remain limited and coa- lescence analyses continue to be computation- ally burdensome, some research groups are expanding the boundaries of comparative phy- logeographic and population genetic research. Specifically, a recent paper, published earlier this year, by Selkoe and colleagues, jointly ana- lyzed population genetic patterns in 35 coral reef-associated species across the main and northwest Hawaiian islands. By biogeography standards this is a tiny dataset, but for a phy- logeographic study 35 species is impressive. Selkoe et al. (2014) did not include coa- lescence-based analyses, instead relying on multivariate analyses (principal components analysis and redundancy analysis) of popula- tion genetic summary statistics, k-means clus- ters, and ecological variables to assess four types of population structuring: long-term pan- mixia across the 2400-km-long archipelago, chaotic genetic heterogeneity, isolation-by- distance (IBD), and regional genetic structure. The multivariate approach used was based, in part, on another comparative study of 27 high- alpine plants (Miermans et al. 2011). Since we can probably expect more studies like these to arise in the future, it seems timely to comment on these comparative methods, their draw- backs and their merits. First, the entire approach rests upon a foundation of summary statistics (e.g., F ST ), which suffer from well-known shortcomings. For example, several of the datasets in Selkoe et al. did not have enough genetic polymor- phism to statistically reject a null hypothesis of panmixia using gene frequencies. At the other extreme, some of the datasets had too much polymorphism to reject panmixia, because if each individual has a unique genetic variant the maximum attainable F ST value is 0. Issues of polymorphism might be addressed with addi- tional sampling, which could uncover additional genetic signal (e.g., Horne et al. 2013). Alterna- tively, one could collect more loci. Regardless, insufficient data is the underlying problem. Ample amounts of genetic variation are the sine qua non of this method; a problem, considering that researchers are generally forced to rely on available datasets from past surveys, many of which were kept small by the costs of Sanger DNA sequencing. Furthermore, species that are too sparsely sampled, or con- tain too few individuals at each location, may not capture enough genetic variation. Selkoe et al. rightly point out that comparative studies demand substantial rigor in sampling coverage frontiers of biogeography 6.4, 2014 — © 2014 the authors; journal compilation © 2014 The International Biogeography Society

Biogeographic studies commonly amass distribution datasets for hundreds of species in an attempt to describe biogeographic patterns and their underlying processes (e.g., Keith et al. 2013). In contrast, phylogeographic studies are almost always limited to a small number of species and, while able to detect patterns at a population level, lack the replicative power of biogeography to describe those patterns in terms of relevant processes. This is particularly disadvantageous in phylogeographic studies of marine species, because the ocean contains few conspicuous geographic features that might explain population breaks (Horne 2014).
Large, comparative phylogeographic datasets have long been the daydream of molecular ecologists but have previously not been possible because the data didn't exist and rigorous phylogeographic analyses do not scale well with large datasets (Andrew et al. 2013). For instance, coalescence-based phylogeographic model fitting is able to accurately assess population patterns amid stochastic signals and other noise in the data (Beaumont et al. 2010) but is computationally intensive and can become overwhelmingly time consuming as datasets grow larger and modelled scenarios become more complex.
Even while data remain limited and coalescence analyses continue to be computationally burdensome, some research groups are expanding the boundaries of comparative phylogeographic and population genetic research. Specifically, a recent paper, published earlier this year, by Selkoe and colleagues, jointly analyzed population genetic patterns in 35 coral reef-associated species across the main and northwest Hawaiian islands. By biogeography standards this is a tiny dataset, but for a phylogeographic study 35 species is impressive. Selkoe et al. (2014) did not include coalescence-based analyses, instead relying on multivariate analyses (principal components analysis and redundancy analysis) of population genetic summary statistics, k-means clusters, and ecological variables to assess four types of population structuring: long-term panmixia across the 2400-km-long archipelago, chaotic genetic heterogeneity, isolation-bydistance (IBD), and regional genetic structure. The multivariate approach used was based, in part, on another comparative study of 27 highalpine plants (Miermans et al. 2011). Since we can probably expect more studies like these to arise in the future, it seems timely to comment on these comparative methods, their drawbacks and their merits.
First, the entire approach rests upon a foundation of summary statistics (e.g., F ST ), which suffer from well-known shortcomings. For example, several of the datasets in Selkoe et al. did not have enough genetic polymorphism to statistically reject a null hypothesis of panmixia using gene frequencies. At the other extreme, some of the datasets had too much polymorphism to reject panmixia, because if each individual has a unique genetic variant the maximum attainable F ST value is 0. Issues of polymorphism might be addressed with additional sampling, which could uncover additional genetic signal (e.g., Horne et al. 2013). Alternatively, one could collect more loci. Regardless, insufficient data is the underlying problem.
Ample amounts of genetic variation are the sine qua non of this method; a problem, considering that researchers are generally forced to rely on available datasets from past surveys, many of which were kept small by the costs of Sanger DNA sequencing. Furthermore, species that are too sparsely sampled, or contain too few individuals at each location, may not capture enough genetic variation. Selkoe  Collating different datasets, with different sampling strategies and intensities into a single study is no small task; even a single phylogeographic data set can be influenced by numerous factors. Reconciling the noise in 35 datasets is not simple. It is here that the benefit of the multivariate analyses really shines. In Selkoe et al. (2014) andMiermans et al. (2011), multivariate analysis accounted for 11 and 17% of the variation in the data, respectively. Though much of the variation in the data remains unexplained, past studies suggest that weak genetic patterns are often biologically significant (Eble et al. 2009, Horne et al. 2013).
Among the more interesting results of Selkoe et al. (2014) was that only four species exhibited IBD -a pertinent result considering the nearly linear array of habitat. IBD was also associated with invertebrates with shallow depth requirements. The idea that depth range is correlated with dispersal ability is an old notion in biogeography (Brown et al. 1996) but one that has gained recent momentum. Keith et al. (2013) found that for every 10 m increase in depth range, coral taxa were 27% more likely to straddle faunal breaks. Von der Heyden et al. (2013) found that the vertical positioning of shoreline fishes was a significant predictor of population structure.
Curiously, all of the species with chaotic genetic heterogeneity were non-endemics and habitat generalists, while most habitat specialists and endemics exhibited regional population structure. It is difficult to explain these results but the emergence of such patterns showcases the utility of the multivariate comparative approach. Yet, perhaps the most valuable result of Selkoe et al. (2014) is the unexplainable variation in the data. Some of this unexplained variation is probably attributable to random historical events, i.e. differences in the location of first colonization, or the num-ber of independent colonizations of each species. Other variation might be attributable to genetic drift and other stochastic processes.
A few summary statistics not included by Selkoe et al. are a neutrality test (e.g., Tajima's D) and mismatch distribution values τ and θ1. Like all summary statistics, these suffer from specific shortcomings, and the latter two are for mtDNA only, but future studies might consider investigating noise in the data caused by selection and the relative impacts of expansion age and effective population size, as these are related to genetic drift. In the meantime, we are left to ponder whether stochastic signals account for most of the variation in spatial genetic patterns of communities, or if there are better ecological frameworks to consider.
Large comparative phylogeogaphic studies now offer us the opportunity to re-examine the relationship between ecology and genetic patterns and question our understanding of population-level processes. Notwithstanding present methodological challenges, the biggest limitation of comparative phylogeography is the availability of data. The data of Selkoe et al. (2014), while one of the largest datasets of its kind, represented only a minute fraction of the total Hawaiian marine biodiversity. Therefore, in spite of the fact that basic phylogeographic studies are not as publishable as they used to be, basic phylogeographic data, of all types, are needed now more than ever.

John B. Horne
Center of Marine Sciences, University of Algarve, Faro, Portugal 8005-139 john.horne@gmail.com