Robert Gifford Profile picture
Sep 19, 2022 21 tweets 10 min read Read on X
This is a thread about how anyone can use publicly accessible sequence databases and free software to discover new things about #genomes, #biology, and life on earth.

#Bioinformatics #Genomics #Evolution #Science #MolecularBiology #DNA #Research #DataScience #LifeSciences Image
Genomes are absolutely loaded with complex information, most of which we still don’t understand.

Complete and near complete genome sequences are now available for many organisms.

However, decoding the information in these genomes remains a slow and difficult process. Image
A large proportion of most published genome sequences consist of DNA that is incompletely understood in terms of its evolutionary origins and functional significance. Image
Sequence similarity searches, such as those implemented in the *basic local alignment search tool* (BLAST) program, are extremely useful devices for investigating this ‘dark genome’ in silico.

blast.ncbi.nlm.nih.gov
Similarity searches can be used to investigate the properties of a given sequence via comparison to a reference database of annotated sequences.

Similar sequences are (potentially) evolutionary related and therefore may be expected to have related properties.
While BLAST is widely used in molecular biology, it is often only in a confirmatory way.

For example, during my PhD I often used BLAST searches to confirm that amplicons obtained via PCR represented the intended target sequence. Image
However, based on the principle that sequence similarity reflects homology (i.e. evolutionary relatedness), similarity searches can be used to explore genome sequences beyond the constraints of existing maps. Image
There are many inventive ways in which sequence similarity search tools can be deployed.

For example, in PSI-BLAST iterated search strategies are used to increase sensitivity, so that remotely homologous sequences may be detected.

ncbi.nlm.nih.gov/books/NBK2590/ Image
However, even standard BLAST searches can reveal new information about genome features.

For example, most endogenous viral elements (EVEs) have been identified via naive, BLAST-based screens.

I'll demonstate these approaches as part of this thread.

#Paleovirology #DataMining
First a quick bit of background as to how I became involved with in silico genome mining.
I got my PhD in Mike Tristem's lab at Imperial College, researching endogenous retroviruses (ERVs)

We used PCR to amplify micellaneous ERV sequences from genomic DNA.

journals.asm.org/doi/10.1128/JV…

The procedure involved an expensive and laborious cloning step.
However, by late 1999 large sections of the human genome had been sequenced.

This meant that novel ERV diversity could be sampled without the need for any labwork, but instead just using sequence similarity search tools- e.g. BLAST. Image
In 2000 this led to Mike's landmark paper on human endogenous retrovirus (HERV) diversity.

doi.org/10.1128/JVI.74…

He coined the term 'phylogenetic screening' to describe his approach.

It combines BLAST searches with phylogenetic analysis, as I will demonstrate below.
The reverse transcriptase (RT) gene of retroviruses is relatively refractory to mutation and hence evolves quite slowly.

This means it can reliably be used to identify ERVs in tBLASTn searches of whole genome sequences, as shown here:
In the search shown above I screened the genome of the Sunda pangolin (Manis javanica) using tBLASTn and the simian retrovirus 1 (SRV-1) RT protein sequence as a query.

The top hit spans the length of the RT query and also contains the highly conserved active site (highlighted) ImageImage
The level of sequence conservation in RT means that 'hits' identified via tBLASTn can be phylogenetically analysed. Hence 'phylogenetic screening'.

This makes it possible not only to find new ERV sequences, but also to examine their evolutionary relationships. Image
The same approach can also be applied to other conserved regions of the retrovirus genome, such as the transmembrane (TM) region of the envelope gene (phylogeny pictured).

doi.org/10.1128/JVI.75… Image
The attached video shows RT-based phylogenetic screening for ERVs in more detail.

The top match from the pangolin search described above is extracted and examined phylogenetically.

It groups with contemporary betaretroviruses, suggesting it has been acquired quite recently.
The steps shown are as follows:

Retrieve the hit in FASTA format.

Add to an alignment of reference RT sequences
(here I manually aligned the sequence in Se-Al).

Construct bootstrapped phylogeny (RAxML).

View phylogeny in FigTree.
This approach is not limited to retroviruses - it can be used to investigate any sufficiently conserved genome feature.

e.g. in a 2020 paper with @SystemsVirology we used phylogenetic screening to investigate the diversity and evolution of APOBEC genes.

doi.org/10.1073/pnas.1…
BLAST can also be used to help discern the structure of genome features and loci.

e.g. the attached video shows a trick for finding the long terminal repeat (LTR) sequences that flank endogenous retrovirus (ERV) genomes.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Robert Gifford

Robert Gifford Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @Paleovirologist

Mar 1, 2022
*Tamanaviruses* - a primer (medium thread)
(1) Tamana bat virus (TABV) is a highly divergent #flavivirus that was isolated in 1974, from a bat trapped at the exit of Tamana cave, on the North slope of Trinidad's Mount Tamana. Image
(2) TABV was identified at the Trinidad Regional #Virus Laboratory (TRVL) as part of a post-war virus discovery effort lead by the Rockefeller Foundation.

The #Tamana cave system comprises several large limestone caverns and is a roost for several species of bat. Image
(3) Millions of bats fly out of Tamana cave exits to feed at dusk, making for an impressive spectacle.

Researchers at TRVL trapped #bats at cave exits, took samples from these bats and attempted to cultivate viruses. Image
Read 10 tweets
Feb 16, 2022
Short #Jingmenvirus primer - medium thread.

Jingmenvirus is a recently identified group of segmented RNA viruses that are phylogenetically linked to unsegmented flaviviruses. They appear to infect a wide range of animal hosts, including humans. Image
Jingmenviruses (JMVs) contain two #flavivirus-related segments, as well as additional segments of unknown origin. It is thought that JMVs evolved from unsegmented flaviviruses.
doi.org/10.1073/pnas.1…
The evolutionary significance of this shift to a segmented genome is unclear, but comparative studies of JMVs and flaviviruses may illuminating.
Read 15 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(