333 - Genetic Coding

Wouldn't it be wonderful if we could read and write in God, the code for biology? It would good enough if we could read it, and we're getting steadily closer.

This is an experimental entry, to see if there is enough material to justify a whole essay number....

DNA is coded in ACGT, RNA in ACGU, swapping uracil for thymine.

Since there are four relevant bases and that they come in pairs or in triples  we could assume that the possible combinations make 64. 3 of these have been identified as 'stop codons', equivalent to End in programming (these days called coding, but a word I wish to avoid here, along with unnecessary confusion).  Being me, I'd immediately translate the 16 pairings into hexadecimal, from aa=0 to tt=F. However, the triplet thinking works best left in base four or in base 64. 

Of these 64 triplets, we now recognise—in mRNA— AUG as a start codon and UAA, UAG and UGA as stop codons. In DNA swap the U for a T; ATG, TAA, TAG, TGA respectively.

I looked to see what there is available—not at all the same as what is known—on just coronavirus; [5] shows there is a large amount of data available. Really large; to process this requires specialist volume tools, not the odd spreadsheet. Just the sequence read archive (SAR) has 160k runs and there are 47k nucleotide records.

For example, the new variant, which I think has the label B.1.1.7, referenced at [6], indicates (I think it does this) the degree to which the coronavirus mutates (Fig 2), but not in a unit I understand. In terms of nucleotide changes within the genetic code, we might expect one or two per month (but surely that assumes a relatively constant size of infected population?)

1  Adenine, Cytosine, Guanine, Thymine or Uracil, generally called nucleotides. The difference between T and U is a methyl group and a hydrogen atom, on the right of the two red graphics.

Only 24 of the 64 possible ACGT combinations make proteins we recognise as being within DNA. 61 of these specify an amino acid; we might say that that's how we identify these triplets as amino acids.

In DNA, the parings are AT/TA and CG/GC only; these are called base pairs and a gene might be tens of thousands of base pairs long.

From essay 332:

I have been wondering further about biological coding. There are four base proteins and they come in pairs, which suggests pairs might be as many as 16, and so could be represented as hexadecimal. Do all pairings occur? [Apparently not, only 24 of the possible 4-character groups occur as proteins, so this could be packed far more densely in base 12 or 24 or in three letter groups. Suppose we used capital letters A-Z less the I and O? Would one leave these available as some sort of error-catching? How long a series of characters remains the same in a copy and what is the variability in copying? If a virus copies itself a lot, what sort of proportion is a mutation, or is what matters the proportion that is mutation that copies and propagates? How many characters differ for this to be called the same virus, or is the chain length the determining factor? Can I find the genetic code published? PerhapsYes, published in ten char groups in sets of six, such as tctaaacttt aagaatggca gttgcttatg cagacaagcc taatcacttt atcaattttc. which would reduce to   DC01F 083A4 BE7CE 48425 C347F 343FD, using capitals so as not to confuse a&g with A&G. I looked at the alternative of coding in groups of three but almost immediately discovered that this first 60 characters of a cattle coronavirus doesn't fir the 24-character set. More than that, I didn't find any run of four characters that belong in the 24-character set. Does that tell us something? Small sample, of course?

If you though ACGT would make a good acronym, you've long been beaten to it. See here.

