This is a thread about how anyone can use publicly accessible sequence databases and free software to discover new things about #genomes, #biology, and life on earth.
#Bioinformatics #Genomics #Evolution #Science #MolecularBiology #DNA #Research #DataScience #LifeSciences
Genomes are absolutely loaded with complex information, most of which we still don’t understand.
Complete and near complete genome sequences are now available for many organisms.
However, decoding the information in these genomes remains a slow and difficult process.
A large proportion of most published genome sequences consist of DNA that is incompletely understood in terms of its evolutionary origins and functional significance.
Sequence similarity searches, such as those implemented in the *basic local alignment search tool* (BLAST) program, are extremely useful devices for investigating this ‘dark genome’ in silico.
blast.ncbi.nlm.nih.gov
Similarity searches can be used to investigate the properties of a given sequence via comparison to a reference database of annotated sequences.
Similar sequences are (potentially) evolutionary related and therefore may be expected to have related properties.
While BLAST is widely used in molecular biology, it is often only in a confirmatory way.
For example, during my PhD I often used BLAST searches to confirm that amplicons obtained via PCR represented the intended target sequence.
However, based on the principle that sequence similarity reflects homology (i.e. evolutionary relatedness), similarity searches can be used to explore genome sequences beyond the constraints of existing maps.
There are many inventive ways in which sequence similarity search tools can be deployed.
For example, in PSI-BLAST iterated search strategies are used to increase sensitivity, so that remotely homologous sequences may be detected.
ncbi.nlm.nih.gov/books/NBK2590/
However, even standard BLAST searches can reveal new information about genome features.
For example, most endogenous viral elements (EVEs) have been identified via naive, BLAST-based screens.
I'll demonstate these approaches as part of this thread.
#Paleovirology #DataMining
First a quick bit of background as to how I became involved with in silico genome mining.
I got my PhD in Mike Tristem's lab at Imperial College, researching endogenous retroviruses (ERVs)
We used PCR to amplify micellaneous ERV sequences from genomic DNA.
journals.asm.org/doi/10.1128/JV…
The procedure involved an expensive and laborious cloning step.
However, by late 1999 large sections of the human genome had been sequenced.
This meant that novel ERV diversity could be sampled without the need for any labwork, but instead just using sequence similarity search tools- e.g. BLAST.
In 2000 this led to Mike's landmark paper on human endogenous retrovirus (HERV) diversity.
doi.org/10.1128/JVI.74…
He coined the term 'phylogenetic screening' to describe his approach.
It combines BLAST searches with phylogenetic analysis, as I will demonstrate below.
The reverse transcriptase (RT) gene of retroviruses is relatively refractory to mutation and hence evolves quite slowly.
This means it can reliably be used to identify ERVs in tBLASTn searches of whole genome sequences, as shown here:
In the search shown above I screened the genome of the Sunda pangolin (Manis javanica) using tBLASTn and the simian retrovirus 1 (SRV-1) RT protein sequence as a query.
The top hit spans the length of the RT query and also contains the highly conserved active site (highlighted)
The level of sequence conservation in RT means that 'hits' identified via tBLASTn can be phylogenetically analysed. Hence 'phylogenetic screening'.
This makes it possible not only to find new ERV sequences, but also to examine their evolutionary relationships.
The same approach can also be applied to other conserved regions of the retrovirus genome, such as the transmembrane (TM) region of the envelope gene (phylogeny pictured).
doi.org/10.1128/JVI.75…
The attached video shows RT-based phylogenetic screening for ERVs in more detail.
The top match from the pangolin search described above is extracted and examined phylogenetically.
It groups with contemporary betaretroviruses, suggesting it has been acquired quite recently.
The steps shown are as follows:
Retrieve the hit in FASTA format.
Add to an alignment of reference RT sequences
(here I manually aligned the sequence in Se-Al).
Construct bootstrapped phylogeny (RAxML).
View phylogeny in FigTree.
This approach is not limited to retroviruses - it can be used to investigate any sufficiently conserved genome feature.
e.g. in a 2020 paper with @SystemsVirology we used phylogenetic screening to investigate the diversity and evolution of APOBEC genes.
doi.org/10.1073/pnas.1…
BLAST can also be used to help discern the structure of genome features and loci.
e.g. the attached video shows a trick for finding the long terminal repeat (LTR) sequences that flank endogenous retrovirus (ERV) genomes.
Share this Scrolly Tale with your friends.
A Scrolly Tale is a new way to read Twitter threads with a more visually immersive experience.
Discover more beautiful Scrolly Tales like this.