This is a thread about how anyone can use publicly accessible sequence databases and free software to discover new things about #genomes, #biology, and life on earth.
Genomes are absolutely loaded with complex information, most of which we still don’t understand.
Complete and near complete genome sequences are now available for many organisms.
However, decoding the information in these genomes remains a slow and difficult process.
A large proportion of most published genome sequences consist of DNA that is incompletely understood in terms of its evolutionary origins and functional significance.
Sequence similarity searches, such as those implemented in the *basic local alignment search tool* (BLAST) program, are extremely useful devices for investigating this ‘dark genome’ in silico.
Similarity searches can be used to investigate the properties of a given sequence via comparison to a reference database of annotated sequences.
Similar sequences are (potentially) evolutionary related and therefore may be expected to have related properties.
While BLAST is widely used in molecular biology, it is often only in a confirmatory way.
For example, during my PhD I often used BLAST searches to confirm that amplicons obtained via PCR represented the intended target sequence.
However, based on the principle that sequence similarity reflects homology (i.e. evolutionary relatedness), similarity searches can be used to explore genome sequences beyond the constraints of existing maps.
There are many inventive ways in which sequence similarity search tools can be deployed.
For example, in PSI-BLAST iterated search strategies are used to increase sensitivity, so that remotely homologous sequences may be detected.
The procedure involved an expensive and laborious cloning step.
However, by late 1999 large sections of the human genome had been sequenced.
This meant that novel ERV diversity could be sampled without the need for any labwork, but instead just using sequence similarity search tools- e.g. BLAST.
In 2000 this led to Mike's landmark paper on human endogenous retrovirus (HERV) diversity.
He coined the term 'phylogenetic screening' to describe his approach.
It combines BLAST searches with phylogenetic analysis, as I will demonstrate below.
The reverse transcriptase (RT) gene of retroviruses is relatively refractory to mutation and hence evolves quite slowly.
This means it can reliably be used to identify ERVs in tBLASTn searches of whole genome sequences, as shown here:
In the search shown above I screened the genome of the Sunda pangolin (Manis javanica) using tBLASTn and the simian retrovirus 1 (SRV-1) RT protein sequence as a query.
The top hit spans the length of the RT query and also contains the highly conserved active site (highlighted)
The level of sequence conservation in RT means that 'hits' identified via tBLASTn can be phylogenetically analysed. Hence 'phylogenetic screening'.
This makes it possible not only to find new ERV sequences, but also to examine their evolutionary relationships.
The same approach can also be applied to other conserved regions of the retrovirus genome, such as the transmembrane (TM) region of the envelope gene (phylogeny pictured).
The chronograms - as viewable on NextStrain (left) - give the impression of structure but the phylograms (right) show very little genetic diversity within the cattle-associated clade.
Because there is so little structure and statistical support for branching relationships, I think its difficult to conclude much about what is going on in terms of cross-species transmission.
*Tamanaviruses* - a primer (medium thread) (1) Tamana bat virus (TABV) is a highly divergent #flavivirus that was isolated in 1974, from a bat trapped at the exit of Tamana cave, on the North slope of Trinidad's Mount Tamana.
(2) TABV was identified at the Trinidad Regional #Virus Laboratory (TRVL) as part of a post-war virus discovery effort lead by the Rockefeller Foundation.
The #Tamana cave system comprises several large limestone caverns and is a roost for several species of bat.
(3) Millions of bats fly out of Tamana cave exits to feed at dusk, making for an impressive spectacle.
Researchers at TRVL trapped #bats at cave exits, took samples from these bats and attempted to cultivate viruses.
Jingmenvirus is a recently identified group of segmented RNA viruses that are phylogenetically linked to unsegmented flaviviruses. They appear to infect a wide range of animal hosts, including humans.
Jingmenviruses (JMVs) contain two #flavivirus-related segments, as well as additional segments of unknown origin. It is thought that JMVs evolved from unsegmented flaviviruses. doi.org/10.1073/pnas.1…
The evolutionary significance of this shift to a segmented genome is unclear, but comparative studies of JMVs and flaviviruses may illuminating.