Friday, November 17, 2006

The specification of proteins - part 3

The table referred to in my earlier posts has a list of cytochrome c sequences for 113 different species, all aligned as far as possible using the horse heart cytochrome c as a reference sequence. So, this sequencing and alignment having been carried out, the same process can be carried out for all the positions of all the cytochrome c sequences in the table.

Incidentally, I would like to point out that I made one change, in sequence 21 (ceratotherium simum) – at position 48, there were a series of amino acids that seemed to be incorrectly aligned with the reference sequence. Inserting two empty locations before the sequence DANKNKG, and removing two of the empty locations afterwards gives better alignment.

So the list of species provided was rematched with the amino acid sequences in the table, and then the amino acid sequences were “exploded” into individual columns, giving a table with 113 rows and 118 columns, each containing either a letter or a hyphen. I then counted the number of each different amino acid (letter) in each column. Each amino acid can, as was discussed in the previous post, be encoded by one or more codons. So, if I know which particular amino acids can be present at a location, and I know how many codons encode these amino acids, then I can work out how many of the 64 available codons could work at each position. Thence, I can determine an estimate of the probability that, given a random sequence of DNA, it will code for a functional cytochrome c protein.

Discarding the first 9 places in the table – in other words, accepting position 1 on the horse heart cytochrome c as the first significant position – and the last 2 places, which generally aren't part of the amino acid sequences, I can multiply up the total number of valid codons for each position, to give the total number of possible cytochrome c sequences, assuming that any amino acid that works at a location can be put there to make a viable cytochrome c protein. This comes to about 1.5x10112. There are 106 places under consideration: 64106 is 2.8x10191.

The proportion of valid cytochrome c sequences in this domain space is the ratio of these – that is, 1 in 1.9x1079.

Let's unpack the significance of this a little more. I have not assumed that only one cytochrome c sequence is valid – a challenge directed at many ID proponents. Neither have I assumed that only the 113 given cytochrome c sequences are valid. I have assumed that, if an amino acid appears at a given position in any of these cytochrome c sequences, then that is a “possible answer”. This analysis allows me to construct a very large number of possible cytochrome c sequences, only 113 of which happen to constitute the table, and all of which I am assuming would be functional. Despite this, the proportion of valid cytochrome c sequences in the domain space of 106 amino acid polypeptides is of the order of 1 in 1079. For reference, the total number of atoms in the earth is around 1050.

The range of species that is covered by this survey is very large – everything from humans to rice to saccharomyces. It is likely that additional species would add to the number of conceivable cytochrome c combinations – by showing that different amino acids would work at positions not covered already. However, given how widely the net has been cast with this approach, and the range of species considered, my hunch is that the increase would not be more than a few orders of magnitude.

However, this can be investigated. This process could be carried out omitting several of the sequences, and seeing what effect this has on the ratio. Or, if other candidate sequences of cytochrome c are available, they could be added, again determining the effect that this has.

To consider this from a naturalistic perspective, we can assume that given the key role that cytochrome c has within cells, and given its ubiquity, selection pressures on it would be strong, and in billions of years of evolutionary history, this would have allowed it to arrive at a highly specified form. It is possible to argue that this being so, all versions of cytochrome c that we observe in the world today are far more specified than would have been necessary in the most primitive organisms. In fact, this analysis is also useful from a naturalistic perspective. In determining how specified (improbable) cytochrome c is today, and in estimating the probabilistic resources available in early stages of evolutionary history, we can calculate how effective evolutionary processes are in improving the specification of proteins. This sort of analysis would be very useful if darwinists wish to move away from "hand-waving" explanations towards a solid empirical foundation for their beliefs.