Massively parallel DNA sequencing: the new frontier in biogeog ‐ raphy

. The advent of Sanger sequencing represented a scientific breakthrough that greatly advanced biogeographic studies. However, this technology has several limitations that have hampered more ad ‐ vanced studies in the field. The development of novel techniques which more fully exploit the potential of Massively Parallel Sequencing (MPS) to deliver sequence data at a fraction of the cost of Sanger se ‐ quencing promises to revolutionize biogeographic studies. Approaches like Restriction ‐ site Associated DNA sequencing (RADseq) and UltraConserved Element (UCE) sequencing enable the collection of un ‐ precedented amounts of data for multi ‐ locus studies of population genetics and phylogenetics respec ‐ tively, which in turn can be used for biogeographic analysis. Here we review those and other methods related to MPS, and provide examples of how they can be used in tropical Atlantic biogeography.


Introduction
Traditionally, biogeography has relied upon species distributions to understand what factors shape present and historical biodiversity (Ekman 1953, Briggs 1974. The advent of cladistics provided a new tool for the biogeographer, as the comparison between geographical and biological cladograms, and their inferred histories, allowed for the first time the analysis of congruence between those two sets of data and the potential discovery of geographic processes that shape species distributions (Rosen 1978). The advent of PCR -based DNA sequencing accelerated biogeographic analyses by providing new tools and methods designed to test biogeographic hypotheses in the context of molecular phylogenies and geographic distribution of genetic lineages (Avise 2000, Lomolino and Heaney 2004, Floeter et al. 2008, Briggs and Bowen 2012. Although first-generation sequencing technologies (Sanger et al. 1977) greatly advanced biogeographic studies, relative to MPS, DNA marker development and sequencing under this technology is labor intensive and expensive. As a result, most biogeographic studies are based upon the analysis of a small number of loci (Shendure and Ji 2008). While the sequencing of a handful of genes has significantly contributed to our understanding of evolutionary relationships and biogeographical processes, filling in the data gaps and resolving intractable phylogenies and phylogeographies will require data sets with more loci, and greater taxonomic coverage. The analysis of multiple unlinked DNA loci has the potential to increase our confidence in reconstructing species histories, resolving taxon boundaries, analyzing modes of speciation, and determining evolutionary relationships (Wu and Ting 2004, Dupuis et al. 2012, Hohenlohe et al. 2012. Massively parallel DNA sequencing (MPS) platforms produce enormous amounts of sequence data and the development of methods for tagging and pooling genomic libraries has made possible the multiplexing of hundreds of samples (Emerson et al. 2010, Faircloth et al. 67 frontiers of biogeography 5.1, 2013 -© 2013 the authors; journal compilation © 2013 The International Biogeography Society ISSN 1948-6596 perspective Massively parallel DNA sequencing: the new frontier in biogeography Luiz A. Rocha 1,* , Moisés A.Bernal 1,2 , Michelle R. Gaither 1 and Michael E. Alfaro 3 1 2012). These new sequencing technologies are dramatically increasing the efficiency of genomic data acquisition and lowering costs are rendering MPS feasible for most labs (McCormack et al. 2013a), however, bioinformatics (both in terms of processing power and programming) remains as an important limitation.
Here we describe how MPS promises to change the field of biogeography. We will not review sequencing platforms, methods, or analyses as these have been reviewed extensively elsewhere (Shendure and Ji 2008, Kircher and Kelso 2010, Metzker 2010, McCormack et al. 2013a. Instead, we focus on the applications of MPS to the study of underlying causes of organismal distributions. We highlight cases where MPS platforms have been employed to improve the resolutions of biogeographic questions and consider how the advent of MPS will impact biogeography studies in general and Atlantic studies in particular.

Inferring phylogenies
Resolving the evolutionary relationships between species and the historic events that drive patterns of speciation is one of the cornerstones of biogeographic studies (Ronquist and Sanmartin 2011). Molecular systematics has traditionally relied upon sets of a few genes representing a tiny fraction of the genome to resolve phylogenies. Locus discovery has been a limiting step for many projects due to the time and effort required for primer design, tests to ensure cross species amplification, and optimization of PCRs. Second, and now third-generation sequencing technologies are rapidly changing this paradigm, allowing the collection of millions of base pairs of data or even entire genomes, for non-model organisms (Glenn 2011). The push for larger numbers of independent loci is driven in part by the recognition that individual gene trees are often discordant (Maddison 1997), a fact that has resulted in a move toward analytical approaches that attempt to reconcile discordant gene trees within a single phylogeny (Degnan andRosenberg 2009, Edwards 2009). Further driving the field are developments in analytical software that make analyses of large multi-locus data sets possible (Hey and Nielsen 2007, Heled and Drummond 2010, Yang and Rannala 2010.

Amplicon sequencing
Several methods use MPS platforms to sequence sets of orthologous loci across taxa. Choosing a suitable technique depends on number of loci desired, bioinformatic capability available, and whether a specific set of loci of known function is desired. Some phylogenetic studies would benefit from the sequencing of specific genes or the use of a set of known loci across a large number of individuals. This targeted sequencing of previously developed orthologous loci can be accomplished by using a class of methods termed amplicon sequencing or parallel tagged sequencing (Meyer et al. 2008). This method involves the pooling of tagged PCR products from multiple individuals and is particularly useful when a collection of known markers, which reliably amplifies across the taxonomic group of interest, is available (Bybee et al. 2011). The advantage of amplicon sequencing over traditional Sanger sequencing is that data for large numbers of individuals at a few to dozens of loci can be sequenced quickly at low cost. Also, because MPS data are single stranded, issues involving allelic phase for nuclear loci are resolved, doing away with the need for expensive and timeconsuming cloning. However, these methods do not eliminate several of the labor and cost intensive aspects of Sanger sequencing: PCRs are needed for each individual at each locus, and locus discovery and primer design remain as potentially time consuming steps (Table 1). Multiplex PCRs could circumvent some of the front-end laboratory work, however, these methods have their problems such as the production of spurious amplification products and uneven amplification of some target sequences (Elnifro et al. 2000). Avoiding these pitfalls can require timeconsuming optimization experiments.
Despite the limitations, amplicon sequencing has been a successful strategy in many mitochondrial genome based studies. Chan et al. (2010) sequenced 51 mitogenomes across three genera of Hylobates gibbons, producing a robust Table 1. Commonly used genomic methods for phylogenetics and phylogeography of non-model organisms, their applications, benefits and limitations. phylogeny. Morin et al. (2010) were able to resolve previously intractable polytomies among ecotypes of Killer Whales (Orcinus orca) using the full 16,390 base pair mitogenomes of 143 individuals. Amplicon sequencing has shown some utility in sequencing large numbers of nuclear loci, though published studies to date have been primarily of methodological nature. For instance, in a test case, Bybee et al. (2011)

Targeted enrichment
Targeted enrichment of DNA (Hodges et al. 2007) ''captures'' specific DNA sequences by hybridizing target DNA to probes in solution or anchored to a surface (i.e., beads, microarray). Recent approaches that capitalize on this technology make use of a class of markers that are anchored by ultra-conserved regions of the genome. The function of these ultra-conserved elements (UCEs) is not yet understood but they are seemingly ubiquitous across animals and due to their highly conserved nature, they are ideal regions for probe design. A method described by Faircloth et al. (2012) allows sequencing of thousands of loci without locus-specific PCR. Probes are designed by screening alignments of published genomes from divergent taxa for regions of high sequence similarly. During library preparation, the genomic DNA is fragmented, mixed with and allowed to hybridize to the DNA (or RNA) probes. The "captured" DNA is amplified and then sequenced using Illumina technology. Variation in the regions flanking the UCEs provide thousands of independent orthologous loci distributed across the genome and because of high conservation these loci are easy to align across divergent taxa making UCEs useful for phylogenetic studies (Stephen et al. 2008). In a series of studies, UCEs were employed to recover established phylogenies among basal lineages of birds (McCormack et al. 2013b), to resolve the placement of turtles as sister to birds and crocodilians  Gene expression, selection, phylogenetics and recently applied to population genetics Because of its conserved nature, allows for the identification of gene function even when genomic resources are scarce Tissues require special care before extractions. Highly expressed genes may dominate the sequences this technique holds great promise for resolving deep level phylogenies. Lemmon et al. (2012) describe a similar method for phylogenetic reconstruction termed anchored enrichment, however, these authors amplify fewer loci (tens to hundreds) and target less conserved regions for probe design, both measures that allow for the sequencing of more individuals to resolve questions at theoretically shallower time-scales than UCEs. The reported advantage of targeted enrichment methods is the ability to adjust the time scale by increasing target DNA length (longer fragments should equal more variation) and once a capture probe set is developed for a taxonomic group the same set of probes can be used to obtain loci from any collection of organisms within that group. However, similar to amplicon sequencing, locus design can be time consuming (Table 1), but probe sets are already available for many taxa, including reptiles, birds and mammals.

Transcriptomes
RNA molecules (the transcriptome) are another source of orthologous loci and offer several advantages. Transcriptomes are by nature a reduced representation of the genome, making sequencing less costly and faster than whole genome sequencing ( Table 1). The availability of published transcriptomes and the presence of fewer repetitive elements in transcripts make their assembly easier than whole genomes (Grabherr et al. 2011).
Transcriptome studies begin with RNA extraction, reverse transcription of RNA into cDNA, and DNA sequencing for the taxonomic group of interest. Bi et al. (2012) described a modified sequence capture technique in which the sequenced transcriptome is assembled and annotated (using published transcriptomes) and specific exons are chosen for array design. However, because these arrays are designed from taxon specific transcriptomes they may be most useful in sets of taxa of low to moderate phylogenetic distances and perhaps even for population genetics and demography studies. In contrast, Hittinger et al. (2010) forgone array design and instead directly sequenced cDNA libraries from species of Anopheles mosquito. The authors demonstrated that hundreds of orthologous genes could be consistently recovered using this strategy, and that these gene sets enable robust phylogenomic analysis. Similar approaches have been applied to deeper level phylogenetic questions involving mollusks (Kocot et al. 2011, Smith et al. 2011b) and the placement of the enigmatic Myzostomida marine invertebrates in the tree of life (Hartmann et al. 2012).
An advantage brought by transcriptome analysis, especially for low-level phylogenetics, is that the RNA being analyzed is all functional, and therefore may be affected by selection. This allows researchers to not only detect outlier loci that might be under selection, but may also reveal the function of those loci, which in turn can be tied to environmental selective pressures (Schwarz et al. 2009). However, transcriptome analyses have one important limitation: tissue sample quality. RNA degrades much faster than DNA, and tissue samples collected years ago that could be used for genomic DNA analysis are almost completely devoid of RNA. Therefore, with few exceptions, fresh tissue collections and special preservation methods are needed for transcriptome studies (Copois et al. 2007).

Whole genomes
As the cost of MPS continues to fall, whole genome sequencing is becoming more feasible for a larger number of laboratories. Complete genome sequencing is now possible for non-model organisms providing an opportunity for genome-wide comparisons to examine structural genomic variation and recombination, chromosomal rearrangements and variations in gene copy number. However, inadequate bioinformatic capacity (both hardware and software) remains the main impediment for large scale whole genome comparisons, and the few studies using this methodology are still limited to model organisms and use only a subset of the data (e.g., Drosophila; Begun et al. 2007).

Defining species boundaries and investigating hybrid zones
Defining species boundaries has long been a con-tentious topic in evolutionary biology (Sites and Marshall 2003). Over the past several decades, the use of DNA sequences, rather than exclusive reliance on morphology, has become increasingly common in taxonomic studies. However, this use remains somewhat controversial, even when cryptic genetic variation is discovered (Knowlton 2000, Paquin andHedin 2004). This is especially true for the now widespread practice of using one or a few regions of the genome (DNA barcoding) to diagnose species (Moritz and Cicero 2004). Critics of this barcoding point out that 1) genetic markers commonly used to resolve species boundaries have variable power of resolution across different taxonomic groups (Hollingsworth 2011), 2) the commonplace reliance on a single locus ignores introgression and can lead to erroneous identifications (Croucher et al. 2004, Rocha et al. 2008b, DiBattista et al. 2012) and 3) the low mutation rates of most DNA barcoding genes will likely underestimate biodiversity (Elias et al. 2007). Traditional barcoding methods are likely to be suitable for identifying unknown individuals of wellresolved taxa (e.g., unidentified fish fillets or larvae; Wong andHanner 2008, Weigt et al. 2012) and, despite their limitations, have been instrumental in the detection of distinct cryptic lineages among morphologically conserved species (Bickford et al. 2007, Burns et al. 2008). However, the small size of most first-generation sequencing datasets often precludes robust resolution of species boundaries in cases where gene flow is ongoing or differentiation is very recent (e.g., African cichlids; Wagner et al. 2013). Even though inconsistencies will probably remain in some cases, the development of MPS-based approaches promises to overcome many of these limitations through the sheer number of loci that can be economically interrogated.

Hybridization
The study of zones in which species hybridize helps us understand how speciation can proceed with gene flow (Hewitt 2008, Payseur 2010. The post-glacial suture zones of North America and Europe are model examples, where climate warming and glacial melting following the end of the last glacial maximum allowed populations to expand, bringing previously isolated species back into secondary contact and leading to hybridization (Hewitt 1999, Swenson andHoward 2005). For example, extensive introgression and replacement of polar bear mtDNA with brown bear mtDNA was detected (Edwards et al. 2011), and this evidence suggested a time to most recent common ancestor (TMRCA) of 150kya between hybridizing populations (Lindqvist et al. 2010). A recent study applying MPS data to this system (Miller et al. 2012) illustrates the power of this approach to enrich phylogeographic studies. More than one million single nucleotide polymorphisms (SNPs) of nuclear DNA were used to perform phylogenetic analyses between polar, black and brown bears (Miller et al. 2012). These results confirm the hybridization, but also indicate that polar bears and brown bears initially split 4-5 Mya, experienced a long interval with little to no gene flow, and hybridized only recently, likely because of the effects of historical climatic fluctuations. Furthermore, the large number of markers allowed the construction of a detailed time line of the effective population size of polar bear populations through the application of a pairwise sequential Markovian coalescent model (Li and Durbin 2011). This model estimates TMRCA in diploid sequences using changes of density of heterozygous sites across the genome, and indicates that the effective population size (Ne) of this species is strongly associated with climatic oscillations (Miller et al. 2012).
It has long been assumed that hybrid zones were not as common in marine systems due to the high dispersal potential of marine organisms. However, the recent discovery of hybrid reef fishes at Christmas and Cocos Keeling islands (eastern Indian Ocean) highlighted the commonness and importance of hybridization across several marine fish families (Hobbs et al. 2009). It is hypothesized that changes in sea level during climatic oscillations led to the exposure of a land bridge between the Pacific and Indian Ocean around the Coral Triangle area, causing differentiation through isolation between species in these two oceans (Woodland 1983, Gaither and. Once sea levels increased, taxa that were separated came into secondary contact, especially around the islands of Christmas and Cocos Keeling, just west of Indonesia. Today, 11 pairs of reef fishes of six different families of the Pacific and Indian Ocean are known to hybridize in coral reefs around those islands (Hobbs et al. 2009). Since hybridization is proving to be much more common in marine fishes than previously thought, the unprecedented ability of recovering thousands of nuclear loci for non-model organisms offers the opportunity to study speciation with gene flow and reticulate evolution on a large scale (Bowen et al. 2013).
Another example where MPS provided new insights on hybridization is in the diverse Heliconius butterflies. Several species of Heliconius are unpalatable and Müllerian mimicry of warning colors enables species to share the cost of educating predators. Multilocus gene genealogies among three species of this genus (two with overlapping distributions, and the third with a distinct range) revealed a striking pattern: gene trees related to mimetic color patterns show a closer relationship between species with sympatric distribution, whereas other genes put two of the allopatric species together. Thus, introgression in this group plays an adaptive role, as the genes with the strongest signal of introgression are those that confer the advantage of Müllerian mimicry (Dasmahapatra et al. 2012, Pardo-Diaz et al. 2012).

Morphological conservatism
The advent of DNA sequencing raised awareness about cryptic species, as it became easy for researchers to detect and differentiate unique genetic lineages within morphologically very similar or even identical species (Bickford et al. 2007). However, species delimitations are usually done with one or a few loci, which does not allow the exploration of the genomic mechanisms underlying morphological conservatism. The mechanisms driving such conservatism are still unknown. Traditionally, three non-mutually exclusive hypotheses are used to explain cases in which single DNA markers show divergence but morphology does not: random mutation and genetic drift that does not affect morphological characters; developmental constraints limiting the evolution of phenotypes; and similar selective pressures in similar environments that result in conserved morphology (Arnold 1992, Smith et al. 2011a. Considering the lowering costs and successful applications of MPS in non-model organisms, it is now feasible to track the evolutionary history of thousands of individual loci (Hohenlohe et al. 2012), develop linkage maps in only one generation (Amores et al. 2011), and determine the role of loci differentially expressed between lineages within individual groups (Baldo et al. 2011). Even though many cases where cryptic species are discovered may not represent true morphological stasis and differences might be revealed by more detailed morphological analyses (Knowlton 2000), MPS has the potential to bring us a much better understanding of cryptic speciation.

Phylogeographic surveys
The vast majority of phylogeographic studies are based upon analysis of one or several tightly linked mitochondrial markers (Avise 2000, Rocha et al. 2007. This bias is a response to the limited availability of primers that can be used across large numbers of taxa, and their presumed neutrality and fast mutation rates (Avise 2000). The limitations of this approach are now well appreciated. For example, hybridization commonly leads to introgression, masking any signal of divergence between closely related species (Rocha et al. 2008b). Further, because mtDNA is maternally inherited, results may be skewed when females and males have different life histories, (e.g., male mediated gene-flow in sea turtles; Bowen and Karl 2007). The more recent trend of including nuclear DNA markers represents an important development in phylogeography.

Single Nucleotide Polymorphisms
Microsatellites have been favored as the standard for population genetics with traditional sequencing methods, however, single nucleotide polymorphisms (SNPs) offer a powerful alternative (Coates et al. 2009). On the downside, SNPs have a lower power of resolution: the power given by 10 to 20 microsatellites has been estimated to be equivalent to the use of 100 SNPs (Liu et al. 2005). In addition, considering that F statistics are calculated over all loci, divergence can be overestimated when some of the observed loci are under selection pressure, or linked to such regions (Helyar et al. 2011). With the advent of MPS both of these issues can now be easily resolved, as tens of thousands of SNPs can be recovered via the use of Restriction Site Associated DNA (RAD) methods for multiple individuals at the same time. This methodology reduces the representation of the genome by cutting it in predetermined sizes and orthologous positions, but the large amount of restriction site position variation above the species level limits this technique to population or closely related species level studies (Baird et al. 2008, Peterson et al. 2012, Table 1). Analyzing patterns of population structure from thousands of loci obtained in a single run can overwhelm signals of selection caused by a small number of outlier loci (Emerson et al. 2010) or help us detect outlier loci under the influence of selection (Hohenlohe et al. 2012). This may lead to the replacement of microsatellites by SNPs, as the main marker for the study of population genetics in the near future (Seeb et al. 2011).
Even though loci under strong divergent selection might inflate the level of differentiation between populations, they are becoming a useful tool for individual assignment tests between fisheries stocks that have low levels of differentiation with traditional markers. For example, the populations of salmon from Eastern Canada have been subdivided in stocks of the Inner and outer Bay of Fundy, based on their migratory behavior. Despite the presence of rare mitochondrial haplotypes unique to the Inner Bay, there is limited evidence of genetic differentiation between the two locations, as previous studies have not been able to differentiate members of both populations (Fraser et al. 2007). However, using 320 SNPs, Freamo et al. (2011) were able to find small but significant differences between those populations. Further, four independent tests demonstrated that eight of the observed loci were under divergent selection.
In a separate test, nine loci were genotyped through an assay in order to perform assignment tests. The assignment test was concordant with the SNP matrix, as it revealed small but significant divergence between individuals of the inner and outer Bay of Fundy (Freamo et al. 2011). Considering that in recent years there has been a significant decline in the number of adults that return to spawn at the Inner Bay, this tool is very useful for conservation as it enables the determination of the origin of the few adults that reach spawning sites.

MPS and tropical Atlantic biogeography
The tropical Atlantic represents an excellent system in which to test biogeographic and evolutionary hypotheses. It is relatively small in size when compared to the Indo-Pacific and somewhat closed, bordered to the east and west by continental landmasses and to the north and south by cold waters. This relative simplicity has allowed for detailed studies and syntheses about biogeographic processes at work in the Atlantic (reviewed in Floeter et al. 2008). But all of this work has been based either on phylogenies or phylogeographies estimated with Sanger sequencing technology or species distributions, so a crucial question is: which advances can MPS bring to our understanding of diversification patterns in the Atlantic?
One possibility is the use of MPS to compare the signal left in the genome by different speciation modes. In addition to the characteristics mentioned above, the Atlantic was also the stage of a dramatic geologic event, the closure of the Isthmus of Panama. When the Isthmus closed, several species were divided in two, and since the geology of the area is well known, we can date back the time of separation between these pairs to at least 3 My (Lessios 2008). There are also many examples of possible selection-driven speciation with gene flow in Atlantic fishes (Rocha et al. 2005a, Puebla et al. 2007, and some genera contain possible examples of both cases (Rocha et al. 2008b), which provide unique opportunities for comparisons using MPS. The use of SNP analyses through RAD-sequencing can quickly provide much needed additional information to confirm these proposed cases of speciation with gene flow, and when coupled with comparisons of species pairs that diverged in isolation, they might reveal a genomic signature of speciation mode. MPS can also bring better resolution to analyses of events that occurred in a more recent time scale. Several lines of evidence indicate that two of the most important biogeographical barriers in the Atlantic, the Amazon barrier and the Benguela barrier, are porous and their effectiveness changes with sea level and climatic fluctuations: the Amazon is frequently breached in periods of low sea level, and Benguela is crossed during warm interglacials (Rocha 2003, 2005b, 2008a, Robertson et al. 2006. Through RAD-sequencing, thousands of loci can be obtained to estimate migration rates and conduct coalescent analysis with higher confidence than the current estimates based on just a few genes. However, this method suffers from similar limitations as transcriptomes and works better with recently preserved tissues. On the phylogenetic side, relationships among species of many important groups remain either completely or partially unresolved. Many Atlantic species are missing from phylogenetic hypotheses of several important coral reef fish groups (Rüber et al. 2003, Barber and Bellwood 2005, Fessler and Westneat 2007, Miller and Cribb 2007, and trees are not even presented for other important groups such as the basses (fishes of the family Serranidae). The few complete (or close to complete) phylogenies presented Hastings 2007, Sanciangco et al. 2011) could be better resolved with more loci. Currently, the best tool to resolve phylogenies is the sequencing of areas adjacent to UCEs, which are being used with success in several other groups (McCormack et al. 2013b). But regardless of which tool is used to answer which question, one thing is certain, MPS will revolutionize the way we do biogeography, and we think the changes are welcome as we will finally be able to explore many questions previously impossible to properly address with current tools.