Evolutionary processes shape the genomic landscape in stochastic and predictable ways. The molecular patterns that we observe in regions encoding proteins (CDS) across multiple species can inform us about the relative roles of mutation, selection, and drift, at the level of each codon. With excellent quality genomic assemblies and expert annotations, the common fruit fly, Drosophila melanogaster, has become an indispensible genomic tool for comparative and functional analysis.
We provide flyDIVaS as an updatable evolutionary genomics resource of Drosophila divergence and selection, based on current FlyBase gene models and OrthoDB orthology calls. flyDIVaS provides D. melanogaster-centric orthology-paralogy calls, gene/protein alignments, and dN/dS statistics including model-based tests of positive selection. In this particular instance, flyDIVaS_v[V], out of 13,920 protein-coding genes in FlyBase R6.05,~80% of genes have at least one ortholog in the closely related species, D. simulans, and ~50% of genes have at least one ortholog from all 12 species. Choosing which species group dataset depends on the particular question to be addressed. flyDIVaS provides data via two portals: 1) a gene-specific summary for D. melanogaster-based orthologs, and 2) complete datasets to download.
Briefly, flyDIVaS extracts protein-coding genes from the latest FlyBase release and identifies 1:1 orthologies across Drosophila species from the 12 Drosophila Genomes project (Clark et al. 2007) using OrthoDB (Waterhouse et al. 2011). FASTA proteins are aligned using MUSCLE (Edgar 2004), backtranslated to CDS, and filtered and masked. CDS alignments are then passed through different evolutionary models of PAML (Yang 1997). This pipeline is repeated across a number of taxonomic groups in order to provide robust datasets for a variety of questions. More details are found in the manuscript (submitted).
This genomics resource provides an update of the now-defunct AAA (Assembly/Alignment/Annotation) site generated by the Drosophila community almost a decade ago for the 12 Drosophila Genomes project (Clark et al. 2007). We would like to take this opportunity to thank those who first generated this important resource for the community, particularly from the Andrew Clark and Michael Eisen labs. If you use data from this database, please reference both the Clark et al. (2007) and this manuscript (which is currently being submitted), when accepted. Also special thanks to Keith Davis, Gonen Shoham, and Steven Weaver for website development. Finally, we wish to dedicate flyDIVaS to the late Bill Gelbart who was instrumental in organizing and supporting the initial 12 Drosophila genome initiative and whole selfless devotion to providing data to the community inspired the development of this database.
In the future, we would like to include educational curricula to train students and researchers how to handle, analyze, and visualize these large datasets. In the meantime, we suggest budding coders and informaticians learn how to work with such data using online resources such as our "playbigdata.org" linux-based training modules.