Monday, November 13, 2006

The specification of proteins - part 1

Part of the problem in the scientific debate between ID proponents and opponents is that much of it is conducted with relation to intractable universal problems. The debate then tends towards a pythonesque “Yes it does”, “No it doesn't” series of contradictions. I am keen to try and use some of the ideas to look at more tractable specific problems, and use principles learned working on these to extend the debate to areas where more definite conclusions can be drawn.

Proteins are specified. That is to say, as we find them today, they aren't simply random sequences of amino acids. The information that they incorporate allows them to express functionality that is of use to a cell or to an organism. Take Cytochrome C, for example. It has a functional specification – a substantial one. Its function can be described in terms of what it achieves for an organism – its Wikipedia entry gives detail about this. It can also be functionally described in terms of the low level biochemical reactions that it catalyzes.

Since Cytochrome C is a protein, it is also coded by a specific sequence of amino acids. Actually, this isn't strictly true. The sequence of amino acids that codes for Cytochrome C is different in different organisms. A table listing different amino acid sequences for Cytochrome C in 113 different species can be found from here.

Just how much specified information does Cytochrome C contain in the sequence of its amino acids? If we know how much information it contains, then we can calculate how likely it is that Cytochrome C would appear by chance – the probability that a random polypeptide would happen to be Cytochrome C (or close enough to be useful for natural selection). And if we can determine this, we can say how reasonable is the chance hypothesis for explaining the initial appearance of Cytochrome C.

If there were only one sequence of 100 amino acids that was universally used to code for Cytochrome C, the probability of it (20 possible amino acids, 100 places in the chain) appearing as a random polypeptide sequence could be simplistically expressed as 1 in 20100 – that is about 10-130. This probability is low – but it is above Dembski's UPB, and much more significantly for opponents of ID, it substantially overestimates the specification of Cytochrome C even within our understanding.

But to make sure that we get our foundations right, it's necessary to see that we have already gone wrong at this point, since we are not properly considering the reference frame. The reference frame is actually not simply the sequence of amino acids that make up Cytochrome C, but the genetic coding of these amino acids. The organism doesn't record Cytochrome C as a polypeptide, but as a DNA sequence. So we need to determine what the probability is of a random sequence of DNA coding for Cytochrome C, rather than what the probability is of a random polypeptide being Cytochrome C.

It's also important to point out at this stage the fact that we have not derived the reference frame. To understand this, consider the fact that the words of this post are an improbable sequence of letters that convey information. But they only convey information given the pre-existing reference frame of the English language (ignoring the additional layers of complexity which are represented by the medium on which this is being read) – they say nothing about how the English language came about in the first place. The information required for Cytochrome C to be present in an organism is the DNA sequence that encodes for it in the genes of the organism, but it is also the reference frame which includes the mechanism to convert the DNA sequence into a protein. The task of darwinism – or any “ism” that addresses the issue of origins – isn't only to explain the appearance of Cytochrome C (for example), but also to explain the presence of the reference frame which allows Cytochrome C to be encoded and manufactured to demand.

The question is more subtle even than this. For example, the darwinian presumption is likely to be that the 113 different Cytochrome C sequences enumerated in the table above are functionally identical, and the differences in amino acid sequence simply represent evolutionary divergence. However, it is conceivable that rather than being functionally identical, each version of Cytochrome C is actually specific to the species in which it is found – that the reference frame isn't simply a generic DNA coding and expression framework, but is the specific organism in the case of each protein. This is perhaps unlikely for a relatively simple protein like Cytochrome C, but may be more relevant for complex and specific proteins. This issue is at the heart of the ID objection to many of the co-option scenarios that are proposed to explain the appearance of complex biochemical systems – that it is an unjustified darwinian assumption that proteins can arbitrarily be re-used or re-located within an organism, ignoring the reference frame.

However, these issues can be put aside for now, as long as they don't disappear off the radar indefinitely.

To be continued ...