Wednesday, November 15, 2006

The specification of proteins - part 2

For part 1, see below.

I said below that, in considering how specified Cytochrome C is, “we need to determine what the probability is of a random sequence of DNA coding for Cytochrome C, rather than what the probability is of a random polypeptide being Cytochrome C”. So let's start with Cytochrome C for a horse – the first line in the table referenced before. The sequence of amino acids starts:

GDVEKGKKIFVQKCA ...

and ends:

... KKTEREDLIAYLKKATNE[Stop]

Each amino acid can be coded by one or more DNA codon (Incidentally, I will show my ignorance by pointing out that I am working on the basis that Cytochrome C is encoded in normal genes, rather than mitochondrial genes. I understand this to be the case – see here. However, even if Cyt C were encoded in the mitochondria, the principles discussed here could be rewritten to apply to this). Given that there are 64 codons and 21 different items encoded (20 amino acids and the stop sequence), there is an average of 3 codons per item encoded. But, with a table of the genetic code, we can be more precise than this. Four codons can encode the first G (glycine) in the polypeptide. Two can then encode the next D (aspartic acid) and so on, through to the gene termination, which can be one of three codons.

The entire gene for horse Cytochrome C, then, is encoded by a sequence of 104 codons. There are 64104 possible sequences of codons – that is, 7x10187 – but by multiplying together the numbers of possible codons that would encrypt the given amino acids in each position, we discover that there are 2x1045 permutations that would encode exactly this sequence of amino acids. So the probability of any given sequence of 104 codons encoding precisely the sequence for horse Cytochrome C is the second number divided by the first – that is, about 3.5 x 10-143.

It is worth noticing that this is 13 orders of magnitude less probable than that a random sequence of 103 amino acids would turn out to be horse Cytochrome C, and it would be interesting to know whether this was generally the case (that is, amino acids used in proteins are more frequently those encoded by less than the average number of codons).

However, this isn't the whole story (“But that is not all, no that is not all”). We have 113 different versions of Cytochrome C, and we now need to consider what effect these other versions have on what we can say about the specification of this protein. We can continue with the amino acid sequence for a zebra – sequence number 25 in the table. This differs in just one position from that of the horse – at position 47, it has serine (S) rather than threonine (T). Here is where assumptions start to become important. If we assume that this is a neutral substitution, and is simply evidence of evolutionary divergence, then we can conclude that we could now have any one of ten codons at position 47 – one of the six for serine, or one of the four for threonine. This single change doubles the probability of a random sequence of codons encoding Cytochrome C – albeit only to the unhopeful-looking probability of 8 x 10-143, but we can start to see the direction this will move in.

To be continued ...