Abstract
Introduction
Animal genomes harbor several tens of thousands of protein-coding and RNA-coding genes and the rest are regulatory elements adjacent to genes.
1
Although there are intergenic sequences, which have been called “gene desert”, it is believed that a majority of them may also be parts of genes that have not yet been discovered.2,3 It is important for the entire genome to be regulated timely and accurately through a battery of processes with distinct mechanisms. In prokaryotes (such as
Previous studies on minimal gene clustering have been largely focused on genes in three basic categories of paired orientations according to the relative transcription direction between two neighboring genes: divergently-paired (DPGs, positioned head-to-head but transcribed toward opposite directions), co-directionally-paired (CDPGs, positioned head-to-tail and transcribed in the same direction), and convergently-paired genes (CPGs, positioned tail-to-tail and transcribed toward each other).10,11 It has been suggested that tandem duplication may be the major cause leading to these paired genes (especially CDPGs), and promoter sharing is an plausible explanation for the occurrence of DPGs.4,12 It has been reported that the proportion of DPGs is positively correlated with gene densities as DPGs tend to keep their transcription directions throughout relatively larger evolutionary time scale (eg, human to fugu comparison). 10 DPGs tend to perform similar biological functions being involved in housekeeping functions, as compared to CDPGs and CPGs, and the expression of DPGs is often positively correlated (albeit minor exceptions) at different developmental stages and under pathologic conditions.10,11 Furthermore, when comparing dynamic structural features of DPGs between vertebrates and insects, we found that all three categories of paired genes in insects are less conserved than their vertebrate counterparts, although DPGs in insects also tend to form functional clusters and to share promoters. 13
As to the intergenic distance (longer in metazoa and shorter in fungi), although the distance of transcription starts between two co-regulated DPGs is between a few hundreds and around one thousand basepairs, 12 we recognize the possible function of sequences—often tens of kilo-basepairs in length—between the two neighboring DPGs with respect to co-expression and shared regulatory elements. 14 Furthermore, the bimodality of intergenic distances observed among mammal gene pairs (but not in other vertebrates) suggests that mammals share certain common features in transcription regulation. 11 Until now, how the length of intergenic regions affects the contiguity in regulating multiple genes remains to be illuminated.
As next-generation sequencing technology matures, both cost and throughput are in favor of more basic data acquisition. In future studies, lineage-based data organization will take over the “one-covers-all” fashion and more tools will be developed for handling both larger and more genomes in addition to those for smaller and single genomes, such as those of mitochondrion, 15 plastid, 16 and yeast.17,18 Arecent study has expanded a gene order browser into 74 species but covers only four mammals. 19 In this study we curated 38 mammal and 14 other animal genomes (only use one fungus as out-group) to discover and to display conserved gene clusters across mammals and their sub-groups, such as primates, large mammals, and rodents. In particular, we combine the two concepts that stringently-defined lineage-specific conserved core paired genes (based on both orthology and transcriptional direction) and gene order of ten consecutive genes flanking the core paired genes. We also offer a series of toolkits covering GO functional annotations promoter identification, gene expression, and evolution analysis to help characterizing features of gene clusters (Fig. 1).

A flowchart to illustrate the content and organization of LCGbase.
Using LCGbase, we would like to address several most imperative questions: (1) Although mammalian gene order or genome organization have been thought to be non-randomly distributed among the chromosomes, what is the precise number of genes that tend to move around or to form clusters? (2) How are clustered genes conserved across various definable lineages? Are the forming-and-breaking events evolutionarily selected and functionally meaningful? What are the mechanisms, including rearrangement, translocation, inversion, recombination, duplication, and transposon-mediated episodes, that alter clustered genes? (3) Are we able to define a “core clustered set” for different lineages or subgroups? Are there identifiable chromosomal regions whose gene clusters are evolutionarily stable? (4) How are gene clusters related to nucleosome positioning and chromosome folding in the nucleus?20,21 The questioning continues but the conclusions will be what we have to know for every single gene and its position on the chromosome, not only physically but also functionally.
Functionality
These are several ways to reach available data in LCGbase. First, one can utilize the browse option to direct all annotated genes in the 53 species, and each gene can be found by the link of gene ID. Second, one can take advantage of gene positioning or clustering information to use a gene ID from the neighbouring genes within and across lineages. In particular, the search is strand-sensitive when used to detect strand-specific organizational features of gene clusters and their variations. The database also distinguishes TSS (transcript start site) distances between two adjacent genes in five roughly defined categories: 0–1 kbp, 1–10 kbp, 10–50 kbp, 50–100 kbp, and > 100 kbp. It display ten genes left or right of the core gene cluster and high light all the genes on screen in different colours to indicate their orthologous groups. Furthermore, it assigns random group numbers to order all groups (Fig. 2). Genes that are not assigned in groups are labelled with “X”. Users can click on the hyperlink for each gene to check for detailed annotations (eg, location, structure, ontology, and family). Third, the result page also displays gene orders from different species according to taxonomic and lineage definitions, such as mammals (primates, rodents, afrosoricida, carnivora, chiroptera, lagomorpha), birds (galliformes and passeriformes), reptiles (squamata), amphibians (anura), fishes (beloniformes, tetraodontiformes, cypriniformes, gasterosteiformes), insects (diptera), chordata (enterogona), nematoda (rhabditida), and fungi (saccharomycetales). The information helps to reveal lineage-specific dynamic patterns or rules of gene clusters in lineage groups and sub-group. In particular, the database provides three kinds of downloadable files (xls, cvs and html) containing information including species, gene ID, strand category, and group number, which appears on the search result page. Fourth, we also count species number, strand-specificity, and orthologous gene. Fifth, the database also provides blast tools 22 (ie, to match cDNA sequence with blastn and protein sequence with blastp or blastx) to help users to study their query sequences and associate them to data in LCGbase as well as other databases.

An example of the LCGbase browser (A) and a search result (B). The Inquired gene is ENSG00000171612.
Due to co-regulation, genes in a cluster may have related functions, share promoters, evolve at a similar rate or in a distinct pattern, and show significantly correlated expressions. LCGbase also provides several easy-to-use tools to facilitate the analysis of these features. Due to the fact that gene ID used in this database is the same as the Ensembl gene ID, an ID Converter tool takes charge of converting gene IDs of other systems (eg, Entrez Gene ID, Gene Symbol, Refseq mRNA ID and Refseq protein ID) into Ensemble gene ID. GO Function Classification tool is to compare a query gene list with all genes in both species and GO terms (with at least 10 genes) 23 and performs gene function enrichment analysis to determine whether gene clusters tend to be functionally related or not. This tool adopts the Fisher Exact Test involved in perl Text-NSP module (http://search.cpan.org/dist/Text-NSP/) combing with four multiple testing correction methods (ie, Bonferroni correction, Bonferroni Step-down [Holm] correction, Benjamini ' Hochberg False Discovery Rate, and Not adjusted). 24 Four cut-off values are to be chosen: 0.1, 0.05, 0.01, and 0.001. Promoter Analysis tool is to compare a query nucleotide sequence with the upstream and downstream (from –499 bp to 100 bp, or from –9999 bp to 6000 bp) of experimentally-identified transcript start site (TSS) embedded in Eukaryotic Promoter Database (EPD), which is a promoter sequence collection of model organisms. 25 To illustrate the co-expressed genes in a cluster, we introduced co-expression data of seven animals including human, mouse, rat, chicken, zebrafish, fly, and nematode from COXPRESdb (Gene Coexpression Database). 26 We adopted R package “BioNet” to draw network, 27 when a query gene has correlated expression with other query genes. Evolution Analysis tool includes KaKs_Calulator2.0 toolkit 28 that adopts multiple algorithms and alternative codon tables to compute nonsynonymous (Ka) and synonymous mutation rates (Ks). The ratio of Ka to Ks is a popular statistical measure for selection between one or multiple pairs of protein-coding genes and one may want to know if several genes in a cluster evolve simultaneously.
In the statistics section, we draw two types of figures to describe TSS distance and minimal distance between three cluster classes: CDPGs, CPGs, and DPGs. Minimal distance is defined as (1) the subtraction of the 5'-end of the downstream transcript and the 3'-end of the upstream transcript for CDPGs, (2) the subtraction of the 3'-end of the downstream transcript and the 3'-end of the upstream transcript for CPGs, and (3) the subtraction of the 5'-end of the downstream transcript and the 5'-end of the upstream transcript for DPGs. In the downloadable page, we also provide the characterized features of gene pairs (“–>–>”, “–><–” and “<– –>” to represent CDPGs, CPGs and DPGs, respectively), including gene pair ID, order class, TSS distance, minimal distance, chromosome, gene ID, transcript ID, protein ID, and strand, as well as transcription start site and transcription end site of both genes.
Case Study
Data Collection
We collected positions of genes, transcripts, and proteins as well as other annotation information (eg, Gene Ontology and gene family classification) of 53 species across broad lineages (including vertebrates, insects, nematode, and fungi) from the Ensembl/Biomart Version 62 (www.ensembl.org). 29 We only selected transcripts with the longest coding sequence to represent genes or gene loci. Gene orthology relationship was also retrieved from this database, and we defined orthology between human and other 52 species as well as paralogs within human. In details, we assumed that there is a transitive relationship among homologs so that we combine paired homologs into one group until the group number becomes stable or converged. Based on this evolutionary principle and phylogenetic relationship, we classified all genes into homologous groups.
Implementation
This database is built on a GNU/Linux web-database LAMP framework (OS—linux, web server—Apache, database management program—MySQL, and server-side script—PHP language). At the server-side, PHP takes charge of calling Perl scripts and R functions, and uses GD modules across API (application programming interface) to generate 2D graphs. At the browser-side, we use HTML, Javascript, and CSS to allow users to experience better and convenient interfaces. We also chose SQL scripts and appropriate storage engine for MySQL to optimize the database performance, with three heavily-loaded record tables including gene, orthologous group, and gene annotation from the information of the 53 species. To speed up searching process and time-consuming tasks, we created full-text indexes for key fields in the database, and added Enquiry Optimizing of high-performance matching in MySQL database and Structured Query Language Grammar Optimizing.
Future Work
First, we plan to update the database as frequently as when new species are sequenced and new assemblies are released. We will focus on insect or arthropod genomes for comparative analysis with vertebrate genomes. Furthermore, with the I5K initiative (to sequence 5,000 insect genomes in the next five years), a large number of insect genomes may soon be available. Our preliminary analysis on the two dozen or so sequenced plant genomes also revealed clustering features, but due to the lack of contiguity within the genome assemblies, we are not able to include the data into our database at present time. In the future, however, we will bring in plant genomes to the database to study gene clustering/ordering and distinct gene organizational parameters, such as large genes with small intergenic regions in animals and small genes with larger intergenic regions in plants. 30 We will also curate new annotations when they are published, including regulatory elements and new genes, such as what from ENCODE (The Encyclopedia of DNA Elements) and similar projects.31,32 Second, we will increase the complexity of our curations. For instance, our current organization of genes and their clusters are basically linear. We should be able to incorporate chromosomal structures and organizational information in a tempo-spatial fashion such as early and later replicated/transcribed genes. We should also be able to map nucleosome positioning and packaging information. 33 Third, we can extend the concept “co-expression” or “co-regulation” to genes beyond clusters but neighboring clusters and clusters on chromosomes and chromosome regions (such as subtelomeric and subcentromeric regions). These new additions will lead to a network of genes and their relationships, a path toward systems biology. Finally, we hope to reveal regulatory mechanisms and their related genes that control lineage-specific or species-specific characteristics over evolutionary time scales.
Competing Interests
The authors declare that they have no competing interests.
Supplementary Materials
Supplementary figures SI, S2, and S3 are available from 8540 Supplementary Files.zip
