Abstract
Introduction
Approximately 70% of the Earth’s surface is covered with water. Within this aquatic environment lives a whole myriad of aquatic organisms. These life forms are vital to the global ecosystem sustainability. From an evolutionary point of view, many terrestrial organisms are said to be of aquatic origin and have evolved through millions of years. 1 In other words, it is crucial to study the evolutionary history of aquatic organisms, many of which remain taxonomically challenging to correctly identify, to better understand the origin of all life forms on Earth.
The rapid growth of human populations and development—in particular, land reclamation of coastal regions, harvesting of fishery resources, and aquaculture practices—has caused much public concern about their negative impacts on the environment. 2 Master et al 3 reported that the aquatic fauna of the United States is at high risk of extinction, where up to 70% of all freshwater mussels, 49% of freshwater fishes, 30% of plants, and 20% of mammals and birds are endangered. Global rates are similar for those groups. 4 Many studies have demonstrated that exotic species, habitat loss, pollution, and unsustainable exploitation accounted for most of the extinctions of marine’s wildlife. 5
Putting aside climate change, this ongoing “biodiversity decline” is truly a catastrophe to all species that fully relies on this biosphere to live in. Scientists all around the world have been working on trying to conserve aquatic biodiversity, and one of the tools they have at hand to help in this endeavor is bioinformatics. Indeed, bioinformatics tools can be very useful in studies where the data provide better understanding on the mechanistics underlying the evolution of organisms at the molecular level and can be very helpful for designing holistic conservation and management strategies. For instance, the evolutionary processes that are at play when a species colonizes a new environment provide an opportunity to explore the mechanisms underlying genetic adaptation, which is an essential knowledge for understanding evolution and the maintenance of biodiversity. 6
One of the most exciting technological advancements during the past decades was the development of powerful and high-throughput nucleic acid sequencing techniques to solve questions in phylogenetics and molecular evolution of taxa or their complexes. 7 With the advent of these sophisticated molecular techniques, huge amounts of data have become available. Comprehending these massive amounts of data requires advanced bioinformatics skills and databases to collate. Furthermore, as these data are stored at numerous databases, both public and private institution–based, there is a need to link the different databases to conduct an exhaustive analysis of the data.
Knowledge about the evolutionary relationships among species has been used in many important biotechnological applications. For example, the understanding of viral quasi-species variation allows us to trace routes of infectious disease transmission. 8 The analysis of the host-pathogen relationships in terms of mutual genetic variation can lead to deeper insights into drug design for medical and agricultural purposes. Structural biologists are now focusing on the phylogeny of related organisms to study sets of homologous proteins because these reflect different variants stored in nature and which can reveal structural and functional constraints. 9 While huge amounts of data can now relatively easily and cheaply be produced through, for instance, next-generation sequencing (NGS) techniques, understanding the underlying principles for its applications could only be disentangled through the application of bioinformatics tools.
In this article, we aim to provide insights into the power and usefulness of bioinformatics to better understand molecular evolutionary processes in aquatic animals and how this information can be used as an important input for their conservation. First, we will briefly discuss how the contribution of knowledge of molecular evolution can be important toward aquatic animal conservation followed by some insights on the bioinformatics tools used for this purpose. At the end, we also share some of our thoughts on the challenges and future perspectives of bioinformatics in conservation studies.
Importance of Species Identification Techniques as the First Step Toward Aquatic Animals Conservation
Each aquatic organism has its own unique heredity which makes it special. Each successful species has had the ability to survive in a changing and challenging environment, through adapting to changes, developing immunity to disease, and through selective fitness over generations. 10 Sustaining biodiversity is critical in maintaining the health of our environment and improving the quality of human life. Conserving aquatic animals, plants, and algae will provide food for the growing human population, increases oxygen, and reduces carbon dioxide in the atmosphere; facilitate drug discovery; and has numerous other downstream applications. 11 Every living organism has an important role to play in the ecosystem, either independently or in close interaction with the environment, and thus each has its own value. Without a rich biodiversity, we would have lower food security, limited supply of pharmaceutical drugs, a less healthy environment, and poorer economic status.
Accurate species identification is the basis for addressing many molecular ecological questions and is fundamental to management and conservation. While morphology is often the most economical approach to species identification, there are many circumstances that molecular-based techniques may be particularly useful especially when dealing with cryptic species, juveniles, incomplete specimens, hybrids, and new, unknown species. For species identification, DNA barcoding was and is a taxonomic tool that uses a short, standardized region of the mitochondrial DNA to identify organisms to the species level. Nucleotide sequences for the selected marker of an unknown specimen are then aligned and edited, for example, with a tool such as ClustalW implemented in the Molecular Evolutionary Genetics Analysis (MEGA) software package.12,13
The sequences then are compared with a public database, developed from input of researchers worldwide. The accuracy of species identification relies wholly on the data deposited and the discriminatory power of the molecular markers used. For instance, the DNA marker commonly used for species identification in barcoding fishes is a 648-bp (base pair) region of the mitochondrial DNA called cytochrome c oxidase subunit I (
Recent technologies have provided more advanced and sensitive tools for species detection using the genetic material that is present in the environment, thus so-called environmental DNA (eDNA). 14 Environmental DNA was initially applied to bacterial community composition and functional diversity studies, but recently scientists have started to use eDNA on macrofaunal studies to monitor the presence/absence of rare, endangered, indicator, and invasive species through environmental samples such as water and soil samples, without the need for direct sampling of the target species.15-18 There are shotgun metagenomics and metabarcoding methods that can be used to study the eDNA. While both approaches involve NGS of DNA, the methods serve different purposes and the selection of method relies on the research question being addressed. In shotgun metagenomics, one sequences the total eDNA present in the sample to understand the community composition and functional diversity, whereas metabarcoding uses one or more barcoding genes to detect the presence of a targeted taxonomic group from the soil, water, or air samples, to understand the biodiversity and its abundance. Metabarcoding relies on polymerase chain reaction amplification of gene fragments using a given primer set or sets. Metabarcoding of eDNA has proven to be reliable and cost-effective for monitoring of fish, 19 fish and amphibian, 20 and pathogens in aquaculture.21,22
Some Selected Molecular, Computational, and Bioinformatics Tools
Conservation genetics is an applied science employing molecular tools to study the genetic structure, evolutionary patterns, and interaction process within the context of biodiversity conservation. 23 Neutral molecular markers such as random fragment length polymorphism, amplified fragment length polymorphism, random amplification of polymorphic DNA, single-strand conformation polymorphism, minisatellites, microsatellites, single-nucleotide polymorphisms (SNPs) are used in many conservation genetic studies to unravel the importance of genetic data for taxonomic distinction and management of conservation units. The usage of a small number of neutral markers has raised debatable reliability of these markers in representing the population and species variations at the level of an entire genome, given that these neutral markers are not subjected to selection and local adaptation.
These limitations and the considerable reduction in the cost of NGS have further pushed forward the needs for the transition from conservation genetics to conservation genomics. The application of NGS increases the estimation accuracy for genetic variation at finer population scales, when genome-wide screening of thousands of markers is conducted. These studies are especially useful for marine aquatic species with high dispersion rates allowing gene flow regardless of geographical distance.24,25 Next-generation sequencing facilitates the study of gene interaction with environmental changes as it could determine variations in both neutral and nonneutral markers. Single-nucleotide polymorphism arrays generated by NGS studies have also revealed that putatively adaptive markers provide stronger differentiation signals compared with neutral markers, given that the former strictly relies on selective forces for estimating population divergence.26,27 Next-generation sequencing also allows for the study of gene expression. Expression patterns of various genes in host aquatic species corresponding to multiple environmental stressors or inducers have been well documented.28-32
Currently used NGS platforms include Illumina (formerly Solexa) sequencing, SOLiD system of ABI, the Polonator G.007, Helicos Heliscope, PacBio SMRT sequencing, and Oxford Nanopore. The choice of technologies depends on the throughput capacity, running time, coverage depth, simultaneous multiplexing, cost, and error rates. Experimental designs and sequencing strategies to be applied should fully depend on 3 main categories of research questions, 33 namely, (1) genome-wide screening of genetic variation, (2) identifying nonneutral variation, and (3) integrating environmental and genetic parameters with gene expression analysis. One of the important keys to all conservation genomics studies is the detection and screening of SNPs within the genome, and there are various approaches available with their specific final goals and available resources. Whole genome sequencing and transcriptome sequencing would be great choices for development of SNPs for many follow-up experiments. 33 However, if only a single experiment is performed for a population screening of a nonmodel species, SNPs could be identified using RAD-Tag sequencing. 33 The SNPs can then be screened for either with SNP-chip or with a RAD-tag sequencing procedure. To address the second research question, which is to identify markers involved in adaptation and screening of population for variation in these markers, methods based on NGS such as genome-wide selection scans, genome-wide association studies, and gene-environment association studies can be applied. These methods allow the revelation of genes associated with selection and adaptation of the physiological mechanisms to the changing aquatic environment.34-36 Finally, to identify genes that are associated with populations with different genetic heritability or environmental quality, transcriptome analysis using RNA-seq procedure can be performed.37,38
Applications of NGS involve the management of massive data sets requiring a huge data storage facility and bioinformatics pipelines to effectively compile, process, and analyze the sequence data. An extensive list of bioinformatics tools with respective functions and usages for downstream population genomic analyses is available. Identification of genotyping errors and data filtering programs is crucial in improving data quality. Erroneous data in SNP data sets can be assessed by performing simple estimation of departure from Hardy-Weinberg equilibrium (HWE), and probabilistic genotype calling programs such as ANGSD, 39 and ngsCovar. 40 Data set filtering could highly affect downstream summary statistics. Filtering using minor allele frequencies thresholds is one of the most common methods in filtering RAD-seq data sets, which allows for the removal of sequencing errors and rare alleles. However, to produce robust genetic and demographic inferences, running a trial on filtering parameters is crucial. 41 Downstream computational analyses include the estimation and measurement of various parameters depending on the research question asked. In the context of genomic analysis, most commonly analyzed parameters for aquatic species inbreeding events are multiple-locus heterozygosity and genomic relatedness matrices using Arlequin ver. 3.5.1.2. 42 Using the “adegenet” package, 43 Mantel test estimates recent demographic history and the correlation between population divergences with geographical distance in populations of aquatic species. Other bioinformatics programs to infer complex demographic models and population clustering patterns include multivariate analyses such as principal component analysis and discriminate analysis of principal components using adegenet R and non-model-based methods such as STRUCTURE and ADMIXTURE.44,45 These clustering analysis programs assign individual aquatic species to their respective populations of origin. Comprehensive analysis and interpretation of data based on reliable computational methods are important in producing robust and reliable population genomics data that can then be applied to evolutionary biology and biodiversity conservation.
Current Challenges and Future Perspectives
To date, computational methods and NGS have been employed by conservation and evolutionary geneticists to improve conservation management of marine populations, especially for endangered species.46-48 Despite the technological advancements, analyzing genome-wide molecular data poses a major challenge. Handling large-scale and complex data requires high competency in bioinformatics and the ability to analyze and interpret the vast amount of data and translate them into biological applications. 49 This is further exacerbated by the rapid development of new bioinformatics tools. 50 Nevertheless, there are many workshops, seminars, and online courses and materials that are available to assist researchers in this field.49,50
Next-generation sequencing technologies facilitate the sequencing process with advanced sequencing length and accuracy but genome assembly remains a significant challenge. Current challenges in building a genome assembler include considering sequencing errors, the high-throughput nature of sequencers, short-read lengths, and genomic repeats.51-53 However, these problems can, at least partly, be resolved by using the newest third-generation sequencing platforms, improved data analysis, and advanced mapping technologies and sequence assembly algorithms.
Other challenges in genomic analysis of aquatic species is the more limited number of available reference genomes. Compared with terrestrial species, genomic studies on aquatic species are still lagging behind. 50 For instance, the variable pattern of the genomic architecture of aquatic animals is difficult to observe due to limited information about the genomic location of assayed markers. However, knowledge about the genomic location of assayed markers could be achieved not only through access to full-genome reference sequences of the target species but also through those of closely related species. 54 Thus, although a limited number of high-quality genomes for new species is a challenge to the researchers, it is still possible to extract useful information from other available genomics resources.55-57 The rapid emergence of aquatic resources will facilitate future genomics studies in a broader set of related species, which will provide a better understanding in basic evolutionary and conservation processes in aquatic species.
Upstream process of pipeline is also one of the major challenges in aquatic genomic analysis, ie, the conditions and criteria used for sample preparation. This step is crucial in ensuring high-quality downstream bioinformatics analysis. 58 Aquatic animals have a broad range of qualitative class and criteria, which have to be taken into consideration prior to proceeding with the sequencing process. Biological diversity, nonindigenous species, exploited aquatic animals, and contaminants are the examples of qualitative descriptors and conditions that might affect the data generation. 59 Major efforts need to be performed in the future to build biodiversity monitoring and research infrastructures (data generation, data storage and curation, and data analysis) in addressing this issue.
Conclusions
In conclusion, conservation of aquatic species is vital in ensuring the sustainability of biodiversity. In this era of genomics, bioinformatics and the application of computational and statistical tools play a major role in elucidating the evolutionary processes of aquatic organisms at the molecular level. Ultimately, these tools can provide important indicators for implementing conservation strategies. Despite the still many technological challenges in applying bioinformatics as a conservation approach, all stakeholders must actively involve in making this effort a reality.
