Recently I learned something about DNA that blew my mind, and in this thread, I'll attempt to blow your mind as well. Behold: Chargaff's 2nd Parity Rule for DNA N-Grams.
If you are into cryptography or reverse engineering, you should love this.
Thread:
DNA consists of four different 'bases', A, C, G and T. These bases have specific meaning within our biology. Specifically, within the 'coding part' of a gene, a triplet of bases encodes for an amino acid
Most DNA is stored redundantly, in two connected strands. Wherever there is an A on one strand, you'll find a T on the other one. And similarly for C and G:
T G T C A G T
A C A G T C A
(note how the other strand is upside down - this matters!)
If you take all the DNA of an organism (both strands), you will find equal numbers of A's and T's, as well as equal numbers of C's and G's. This is true by definition.
This is called Chargaff's 1st parity rule. en.wikipedia.org/wiki/Chargaff%…
Strangely enough, this rule also holds per strand! So even if you take away the redundancy, there are 99% equal numbers of A/T and C/G * on each strand *. And we don't really know why.
This is called Chargaff's 2nd parity rule.
Lots of people have advanced theories for why the number of C's and G's should match up, but as yet no slam dunk explanation has been reported. But, hold on, things are about to get even weirder! academic.oup.com/bib/advance-ar…
It turns out the rule also holds for N-grams of bases! That is, as long as you both 'complement' and 'reverse' them. So for N=1, %C and %G are equal.
For N=2, this says that percentage of CC (%CC) and %GG are also equal, as are %AG and %CT (complemented AND reversed) etc.
You can compare this to turning a book upside down and reading it back to front, and finding that all three-letter words occur with equal frequency before and after turning over the book.
For DNA triplets like 'AAA', this looks like this. Left in blue is frequency of 'AAA', the right orange bar shows the reverse complement 'TTT'. And so on for all other 31 triplets. The correspondence is stunning:
And here are the tiny tiny differences for each triplet, all smaller than 0.2%. Note that this plot shows data for _all_ known bacterial chromosomes:
So why is this the case? There are lots and lots of theories, but there is no consensus yet. And that is what makes it so super interesting!
At the very core of life hides a mystery, a mystery that is easy to research from a computer. And I hope that one day soon we'll know for sure what is going on!
/ends
• • •
Missing some Tweet in this thread? You can try to
force a refresh
This is huge news, but easy to miss. We don't all have the same DNA, but many genes exist in different versions. For example, we have a blood type because there are 3 different ABO genes around. Yet up to now the "downloadable human genome" was static, w/a single blood type! 1/
The currently downloadable human genome also appears in significant part to come from @JCVenter, who has done awesome things, but his DNA can't represent us all. Enter the pan-genome - a file format that can encode multiple variations for each point. 2/ berthub.eu/articles/posts…
In the modestly titled "A Draft Human Pangenome Reference", the @HumanPangenome consortium & many of the leading lights in DNA and bioinformatics software development, have published the DNA of 47 diverse individuals, all in a file format that is not a "string" but a graph! 3/
A fun decryption story! In 1914, The Netherlands sent a peace mission to Albania (I did not know this either). The mission commander, Major Lodewijk Thomson, was killed in battle under circumstances that are still unclear. And we'd love to know! en.wikipedia.org/wiki/Lodewijk_…
Recently (2009), an encrypted Albanian telegram from that time was found in Dutch military archives. Could this perhaps shed some light on the situation? Intriguingly, no one had ever been able to decrypt the message.
Dutch researcher Florentijn van Kampen, affiliated with Radboud University's iHub, decided to give it a try using modern cryptographic techniques. I mean, 1914 encryption, how hard could it be?! ecp.ep.liu.se/index.php/hist…
The SARS-COV-2 genome has several genes (or ORFs).
Note in green the famous S spike protein. This is what all the vaccines contain or make in your cells. The green proteins are all "structural", so they end up as part of a new virus particle.
Source: chemistryworld.com/the-coronaviru… 2/
A long time ago, we thought one gene would always deliver one protein. Viruses are acknowledged MASTERS at efficiency, so they don't quite work like that. Note the orange '1a' and '1b' genes above, which are ORF1a and ORF1b below. Source: journals.plos.org/plospathogens/… 3/
Brief thread on how Molnupiravir works. This is the promising COVID-19 antiviral that appears to prevent 50% of hospitalizations/deaths, and maybe 100% of deaths, when given very early to high risk COVID-19 patients. merck.com/news/merck-and…
In general, many many things will stop a virus or a disease, as explained in this @xkcd comic. But that is not what we are looking for. We want something that stops a virus dead, but keeps us alive.
Some very good medicines do succeed in stopping a disease, but can't help but also impact us. This is the case for many antibiotics that are lethal to bacteria, but do gum up some of our own works, for example.
If you are into reverse engineering, the EU Galileo navigation satellites are currently transmitting a new signal that enables centimeter level accurate positioning. But! They haven't yet released a description of this format, but the data is there & unencrypted. 1/2
Let me know if you want a dump of many hours of data. The data likely includes a distance vector that describes a correction to a satellite's position, plus a velocity vector, plus a time offset correcting the atomic clock, plus administrative details ('issue of data number') 2/2
I love the EU (honestly!), but I also love the Internet. Through the NIS 2 Directive, the EU is attempting to regulate each and every root server operator (RSO), even those outside of the EU. Doing so will have bad consequences. 1/6 berthub.eu/articles/posts…
There are 12 RSOs. There are over 1300 active root servers. None of these RSOs are 'providers of essential services' individually. Up to 11 of them could fail, and nobody would notice. In 40 years, "the root" has never gone down. 2/6
By attempting to regulate the core of the Internet, the EU risks opening up a Pandora's box: many other governments would like to follow suit. The EU itself has advocated for the current multi-stakeholder "governance" model, in lieu of government action. 3/6