Introns, Exons, and So-ons
(Part I)

Summary of the new theory:

We're talking introns-early here. Eukaryote genes formed in the primordial pond from random DNA before prokaryote genes. Since exons in eukaryote DNA are much shorter than an entire gene, exons have a much higher probability of being found in random DNA. The DNA of eukaryotic organisms is random, while the DNA of prokaryote organisms is not -- the introns and some stop codons are absent in prokayotes, thus making the distribution of stop codons non-random.

From Dr. Senapathy:

"An organism is built and maintained primarily by the actions of proteins coded by genes in the organism's genome. Superficial probabilistic assessments of whether a gene coding for a specific protein could simply occur by chance in the primordial pond have been profoundly discouraging. But these calculations fail to account for several significant characteristics of genes, described in Chapter 7*, that actually make their occurrence highly probable. In fact, these principles of genes cumulatively make it inevitable that a given gene sequence that can code for a specific protein would have been available in the universal sequence pool (USP). Since the expected mean length of the random sequence is the same for any given gene with typical characteristics, almost any gene coding for almost any protein sequence will occur within this expected mean length of the USP."

* animo acid degeneracy in proteins, codon degeneracy in genes, and the ease of finding short exons in random DNA.

We should note that all genes occurring directly in the USP were split into exons and introns -- typical of eukaryote genes. Finally, the notion that the very first cells must have been complex**, with nuclei -- typical of today's eukaryotic cells -- shows that these cells could have been formed directly from the primordial pond." (page 290)

** Computer analysis of DNA sequences reveals that the very first genes in the primordial pond were split into coding (exons) and intervening (intron) sequences." (page 230)

Also, refer see the bold quotations, below.

Discussion:

Topics:

split genes, introns and exons
stop codons distributions

[in1-1]

From Keith Robison: (quoting Dr. Senapathy) "In this context it should be noted that there are only three competing theories as to how [split] genes have originated on earth.

I don't really wish to argue this point, but it is curious that Senapathy has left out the other major explanation for split (intron-bearing) genes: that introns were inserted "late", after the divergence from a common ancestor. I don't suppose this would have to do with his statement:

"Finally it explains the absence of any correspondence between the domains of the proteins and the exons of the genes, exactly as shown in the recent study reported in the journal Science by Ford Doolittle's group.

Of course, this is also a prediction of "introns-late"! (but before I am branded a heretic,** I'll shut my mouth :-)

** -- Wally Gilbert is my advisor

[in1-2]

From Steve LaBonne: (quoting Dr. Senapathy) Ford Doolittle and colleagues however now say that introns may have been inserted into contiguously formed genes, and support what is called the introns-late premise. But this premise also is untenable, because there is no logical basis for it...

Steve: What does logic have to do with it? Either it happened or it didn't. Now, it's clear that at least some self-splicing introns are rather old (viz. the Group I intron in the cyanobacterial/chloroplast leucine tRNA [UAA anticodon]). But there is now a perfectly good proposed mechanism for the late appearance of spliceosomal introns: to wit, that they have evolved from Group II introns that moved from organelles to nuclei after the endosymbiotic origin of organelles. I don't say this is proven, though there is now a lot of evidence supporting it, but it is absurd to say there is "no logical basis" for introns-late.

And why Senapathy seems to think that this particular debate is the key to the origin of life is quite beyond me. The aforementioned ancient Group I intron may well go back to the origin of the cyanobacteria; that's indeed impressively old, but it's still a long way from the origin of life

JM: It is key because it goes to the probability of finding genes in the primordial pond. Long genes (not the watch company) would be nearly impossible to find, but genes that are broken into pieces (exons/introns) would not only be easy to find, but are inevitable. Have you read his book? If not, that is the way to understand his theory.

Steve: In that case, the preponderance of evidence for the late (postsymbiotic) appearance of spliceosomal introns, assuming it holds up (personally I believe it will), is in itself sufficient to torpedo Senapathy's theory. I'm afraid you can't have your cake and eat it too!

Now it is apparent to me why Senapathy had to be so airily dismissive of the evidence for the lateness of (spliceosomal) introns. If firmly established, then on your account introns-late would suffice to refute Senapathy's theory. Which means, of course, that Senapathy needs to address the now widely accepted scenario for the evolution of spliceosomal introns from Group II self-splicing introns that escaped from mitochondrial genomes (Cavalier-Smith, T. [1991] Trends Genet. 7: 145-148). Note that much biochemical evidence, published after Cavalier-Smith's proposal, supports the key contention that the mechanism of splicing in Group II and spliceosomal introns is extremely similar, and that the snRNA's in the spliceosome are equivalent to trans-acting bits of a Group II intron active site (i.e. the spliceosome + intron system is essentially a highly fragmented version of a Group II intron). Furthermore, what could be interpreted as early stages of such a fragmentation process have actually been observed in organelle genomes (Bonen, L. [1993], FASEB J. 7: 40-46). Also, at the time of Cavalier-Smith's original proposal, Group II introns were known only in chloroplasts and mitochondria but not in their (respectively) cyanobacteria and purple-bacteria ancestors; that piece of the puzzle was also subsequently filled in (Ferat, J.-L., and Michel, F. [1993] Nature 364: 358-61).

Since introns-late pulls the rug out from under Senapathy's fundamental argument, I would be interested in seeing his response to this body of work.

[in1-3]

From Periannan Senapathy: If you are interested in the topic concerning the origin of introns and split-genes, an article I have written on this topic has been published in this week's Science (2 June 95). I have made available a copy of this article and two other accompanying articles, a debate concerning the origin of introns and protein-coding genes, in the web page:

    http://www.genome.com/ibo/science.htm

I think that this will answer many of the questions that people have asked recently in s.b.e. regarding the origin of genes. I will soon post some replies to the comments that have appeared here recently on my theory.

Periannan Senapathy
Genome International

[in1-4]

From Keith Robison: In his letter to Science, Senapathy baldly states that eukaryotic exons have "an upper limit of 600 nucleotides (with rare exceptions)"

I have a dataset from GenBank 70 (about 4 years old) with which to check this claim. Caveats: Only the coding regions were included -- the 5' and 3' exons are truncated by the length of the non-coding region. Also, some exons may be misclassified at (5'+3') if the first and/or last exon wasn't recorded in the GenBank entry.

               size                  >600nt
    Exons       N    mean   stdev    N   %
    --------  ----- ------ ------   -------- 
    All       16525 148.08 240.37   406 2.46
    Internal  10843 161.93 223.01   214 1.97
    5' + 3'    5682 121.65 268.45   192 3.38

Now, "rare exceptions" lacks a quantitative definition, so I will leave it up to the reader to decide whether 2.5% or even 2.0% counts.

[in1-5]

From Alix Martin: A large proportion of the human genome is constituted of introns. These sequence parts do not contribute to the making of the human organism, as no proteins are created from these sequences. Though, the human cells spend energy replicating these sequence parts. There is no obvious short term usefulness for introns. I will here suggest long term factors that justify their presence in our genome. One particular aspect is that the existence of non-coding DNA sequences might be necessary to allow macro-mutations during the evolution process.

As I'm no professional biologist, these ideas might have been around before, or might even be totally flawed. However, as mixing different scientific backgrounds is often a useful process in science evolution, I'll throw them in.

The Darwinian evolution theory has done a lot to explain evolution mechanisms. However, some people do not consider it as fully satisfying. See for instance Mark Ludwig's work on virtual evolutionary environments, simulating Darwinian processes in a computer. [Wired 3.02/Computer Viruses, Artificial Life, and Evolution].

Along the evolution process, changes that are not purely incremental appear. For instance, fish grow feet, or human grow wings. ;-) My intuition is that for such a large change to occur, radically new proteins are needed to drive the animal's morphogenesis. These proteins need to be coded by a DNA sequence. On a Darwinian basis, the apparition of such proteins is linked to mutations in the DNA sequence introducing a new alelle in the species gene pool, that corresponds to the new protein. As mutations are rare, they are likely to occur one by one. My intuition is that important evolutionary steps need new proteins that differ from the old ones by more than one amino-acid. (I call this a macro-mutation). Because mutations are rare, there is a need for a evolutionary pathway between the proteins coded in the non-mutant species' gene pool and the new protein. Each step in this pathway is a one codon mutation in the DNA sequence coding the protein. If the sequence is an exon (a coding one), the mutant corresponding to each step must be a viable one, and the alelle needs to survive until the next mutation step occurs. I see this as an evolutionary tunnel until the useful sequence, coding for a useful protein is attained. For me, it is unlikely that such a tunnel can be crossed without generating a freak at one of the intermediate steps. One argument that can lead to thinking that the tunnels are more than one mutation long is that if a single mutation was sufficient to lead to a useful change, it would happen quickly and take over the entire specie. For me, this is what happens when humans grow taller, not when fish start walking. For me, there are two different time scales here.

A common practice in computer programming is to comment pieces of code that were useful at one time but are not any more. Not throw them away, but keep them as comments, even if they do not contribute to the program's function any more, as they might be useful again in another context or at a further step. Perhaps introns are just nature's way of giving genetic code a comment status. In computer programming, special signs delimitate informative parts of code, like /* COMMENT */ in C. Similarly, there are specific sequences of DNA that mark the beginning and the end of an intron.

There is no selection pressure on the portions of genetic code that are introns. Whatever mutations affect the DNA sequences contained in introns, they are not expressed as proteins, and therefore do not affect the fitness of the allele. Only when a mutation corrupt the start code of the intron does the mutated sequence become meaningful. Then, it is very likely that the new protein generated will be useless, or even will make the allele non viable, but from time to time, a macro-mutation that is useful will appear, and such a mutation could not have been reached if the intron mechanism had not existed. Introns allow species to go through evolutionary tunnels without being subject to selection pressure all way long. Of course, mutations in the start code of the introns are very unlikely, as these sequences are only a few codons long, but this is consistent with the long time scale under which important qualitative mutations affect species.

Thus, introns are not useless, but are a key factor in allowing life to perpetuate itself. It is a real long-term mechanism, just as "normal" mutations are a long-term mechanism, sexual reproduction a medium-term one, and scattered response threshold distributions in cells are a short-term adaptation mechanism (another story).

To verify all this, I would suggest testing that the entropy of non-coding sequences is higher than the one for exons. For this, all that is needed is a databank of a sequence of DNA among different individuals of the same population, the sequence containing both introns and exons. As there is no selection pressure in introns, the entropy should be higher.

[in1-6]

From Keith Robison (in reply to Alix Martin): Just to clear up some definitions here. Introns are transcribed regions which are spliced out of mRNAs. While the proportion of the genome which is intronic is probably greater than the exonic regions, both are probably grossly dwarfed by the intragenic DNA (between genes).

One contender for an explanation [of the usefulness of introns] is that non-coding DNA generally has NO function -- it is just a "selfish" parasite which can be tolerated.

There are selection pressures [ on the portions of genetic code that are introns]. First, the signals for splicing introns are contained in the introns, and so there is a selection to maintain them. Second, those signals can't be compromised by conflicting signals -- so there is a pressure to avoid generating new splicing signals in inappropriate locations.

Actually you are mixing units ["Of course, mutations in the start code of the introns are very unlikely, as these sequences are only a few codons long,"] -- the term codon has no relevance in terms of intron splicing signals. Also, your hypothesis should include both "start-intron" and "end-intron" signals. Of course, one difficulty with extending an exon into the adjacent intron is that the extended exon must match in frame -- 2/3 of the time an exon-extension event will result in an untranslatable message.

You should define "entropy" precisely and describe how you will attempt to measure it. Also, there are factors which might confound your analysis. In particular, non-coding (both inter- and intra-genic) is largely composed of repetitive elements -- segments of DNA found frequently in the genome. Many of these elements are known to be capable of transposition (copying) within the genome.

Also, the databanks used to be heavily skewed towards short introns. With the advent of genomic sequencing, this bias is beginning to lessen but is being replaced by exons and introns which have been predicted by computer but not experimentally verified (not a good base for modeling).

Caveats aside, the field of intron-function is still pretty open. The datasets are only getting better, and so if you're interested in it you should plunge right in!

[in1-7]

From Mark E. J. Newman: Another important point is that non-coding regions are important simply for the physical space they take up. The position in three-dimensional space that different coding regions occupy can have important consequences for transcription regulation, and the presence of non-coding regions can allow the coding ones to take up their proper positions. Thus the actual content of the non-coding regions may be unimportant, but their presence is crucial to proper action of regulatory mechanisms.

This type of mechanism ["comment pieces of code that were useful at one time but are not any more"] is seen in artificial evolution. Pieces of code become inactive and are reactivated to the organisms advantage later on. I should not be surprised to learn that it takes place in nature too, though I can't give you any specific examples.

Actually, some work of this nature has already been done [on "testing if the entropy of non-coding sequences is higher than the one for exons"]. Not on introns in particular, but on non-coding DNA. It was done by H. Eugene Stanley of Boston University and some co-workers whose names escape me, and it was in the last year, but other than that I can't remember where I saw it. The basic idea was to do an information theory analysis of the information content of coding and non-coding DNA as a function of length. The basic result if I remember it correctly, was that the NON- coding DNA had a "message-like" information content, i.e., increasing linearly with the length of the sequence analyzed, but that coding DNA did not - the information content increased slower than linearly. I'll see if I can find the reference anywhere.

Incidentally, in the particular case of introns which you were talking about above, I also know of at least one case in which an intron has a function, even though it is not translated. In that case, the physical presence of the intron in an mRNA that codes for a growth-promoter slows down the translation of the RNA (which cannot take place until the intron has spliced itself out). If you remove the DNA that codes for the intron >from the genome, you still produce the growth-promoter, but you produce it too fast, and tumor-like overproduction of cells can occur.

[in1-8]

From Keith Robison (in reply to Mark Newman): There is a related story of certain developmental factors in Drosophila (e.g. string). These genes have enormous (>100 Kb) messages which take a very long time to transcribe. It turns out that at each cell division incompletely transcribed pre-mRNAs are destroyed. String's transcript is too long to transcribe completely during the first few Drosophila divisions, which occur in quick succession. As a result, a full length string transcript cannot be made until longer-period cell divisions occur. So the sizes of the introns help makes the developmental decision of when the string protein is made!

[in1-9]

From Chip Young: I read somewhere, probably Science News, that the supposed non-coding DNA is quite stable. Close similarities from individual to individual.

Presumably, if it was really useless, it would mutate relatively fast since the environment wouldn't be weeding out errors.

It's relative stability suggests it does something, we know not what.

[in1-10]

From Dave Oldridge: Senapathy simply does bad math. His error is identical to that of creationists who state that evolution could not happen because (for example) elephants are highly improbable.

JM: I understand your objection to Senapathy's math in that example. However, there are two important mathematical aspects to his theory: (1) the probability of point mutations creating new genes, and (2) the probability of eukaryote genes forming in the primordial soup. We've already run the discussion out on part 1. Now, tell me what's wrong with his numbers on part 2.

From Keith Robison: The problem is that Senapathy is playing parlor games, not proposing a workable model. Yes, if you stare at random text you can find a message in it -- but only because you know the message (or a message) to find in it. Biology doesn't work that way -- splicing of an mRNA is not guided to produce only useful mRNAs. There are signals within the original transcript which guide the splicing process.

So, what's important is not the likelihood that a message occurs somewhere in a random sequence after splicing it to fit, but whether the pieces of the message occur flanked by the correct splicing signals. To extend the example of "To be or not to be" from the book in a trivial manner, suppose the letter "Q" is both a start-splicing and end-splicing signal. The question is then what is the probability of finding

    toQ
    (sequence without any Q's)
    QbeQ
    (sequence without any Q's)
    QorQ	
    (sequence without any Q's)
    QnotQ
    (sequence without any Q's)
    QtoQ
    (sequence without any Q's)
    Qbe

You could, of course, change the signal to anything you want, but remember that both your proposed exons must be flanked by signals and both proposed introns and exons must lack it. I think you will find that the statistics aren't much of an improvement over finding your whole target message in random DNA.

[in1-11]

JM: Now, tell me what's wrong with his numbers on eukaryote genes.

From Keith Robison: There are signals within the original transcript which guide the splicing process. ... So, what's important is not the likelihood that a message occurs somewhere in a random sequence after splicing it to fit, but whether the pieces of the message occur flanked by the correct splicing signals.

JM: That is not a problem. Dr. Senapathy uses 600 nucleotides as the typical exon length, and so adding a few more (specifically, 9 + 4 = 13) nts won't make much difference. An exon could be "defined" to include the start and end splices and the probabilities computed from that. 600 is still a good average length to use, but use 613 if you want. In fact, for all but the longest exon of a gene, the additional 13 nts won't make any significant difference to the likelihood of finding the complete gene in a random DNA sequence because the chances of finding the complete gene are only dependent on the length of the longest exon in the gene. The math behind this reasoning is discussed at the start of Chapter 7, and text strings are used there as example "eukaryote genes" (pages 222-230).

Keith: Senapathy is completely wrong here, and I'm surprised you have swallowed it. Because of the way splicing works, what is important is the frequency of splicing signals in the random sequence -- you don't form genes just be taking what exons you want. As I noted before, in his "to be or not to be" example, Senapathy shows no method other than finding a predetermined target. He skips over plenty of legitimate English words (exons), some ("awry") longer than the ones he chose.

JM: Also in Chapter 7, Dr. S. discusses the chances of finding long reading frames in random DNA, and he does an extensive analysis of the effect of the frequency of stop codons. So, contrary to your characterization, I don't see that he's playing "parlor games."

Keith: Then keep looking, or spend more time in the parlor ( :-). In every case (Figures 7.1, 7.2, 7.20) Senapathy first selects the message he is looking for, and then scans the random sequence looking for it. Lots of fun, but completely irrelevant to the field of biology.

Again, the way biology really works (and remember, Senapathy is saying that things can't have changed :-), is that the spliceosome moves down the transcript, and when it hits a "start-splicing" signal, that's the end of an exon. It then scans for a "stop-splicing" signal which marks the beginning of the next exon. The spliceosome knows nothing about open reading frames or phases. As I have pointed out before, the odds of successfully getting an ORF from a moderate number of exons is very low, as 2/3 of your splicing events will be out-of-phase.

...remember that both your proposed exons must be flanked by signals and both proposed introns and exons must lack it.

JM: He does not ignore that requirement. See pages 230-239 and 242-247 plus other places in Chapter 7.

Keith: Senapathy never really deals with this problem. The closest he comes is suggesting that somehow the splicing process can recognize regions densely populated with stop codons (p.245). Again, there is NO evidence for this, and contrary evidence.

How, in this model, can you explain very short exons?

In the current issue of Nature Genetics, there is a report of a coding-region mutation which causes a genetic disease, yet it does not change the predicted amino acid sequence. However, it turns out it generates a "start-splicing" signal, and hence that exon is prematurely terminated.

Senapathy's model shows no correlation with biological reality; Splicing does not know about translation.

And BTW, the distribution of known exon sizes does not fit an exponential distribution (Stoltzfus et al got the distribution right in their Science rebuttal), and there is no "cutoff" of exon sizes at a location convenient for Senapathy (there are exons which encode 1000's of amino acids). Senapathy can't even get his supporting facts straight.

[in1-12]

From Keith Robison: In every case ... Senapathy first selects the message he is looking for, and then scans the random sequence looking for it.

JM: True. In his English-text example "genes," Dr. Senapathy looks for few specified sequences (and finds them all), but he does this to illustrate that any sequence will be found. He is not restricting the search to just those sentences. On the contrary, he is encouraging you to search for any sentence you want (with the sole requirement of the limit on longest word). The random, 3-billion character sequence Senapathy used is too long to publish, or even e-mail. However, you are allowed to manufacture your own random sequence, and you can and should use many more than one, as this would represent the abundance of DNA that was available (see below for some numbers).

Once the length of the longest exon in a gene (including the splice sequences and the signals that start and end a gene) is specified, Senapathy shows how to compute the length of the random DNA needed to assure that that gene and any other gene (with one restriction) will be found there. He shows that the amount of DNA so computed will be a reasonable amount (i.e., that amount would be many times less than the total amount of DNA available in the pond). The "one restriction" is that the length of the longest exon in those other genes must not be longer than the longest exon in the specified gene.

Keith: Again, this is utter hogwash. All he is proving is that you can find the sequence in there if you know what you are looking for; that any biological system could extract it is another matter altogether. Senapathy's calculations are hopelessly naive; the real calculation is much more difficult. But, in general, once you blindly transcribe random sequence and splice it at the randomly occurring splice sites, you will basically find it looks like the DNA you started with in terms of the trinucleotide (codon) frequencies -- i.e., this exercise is not a magical solution to finding long genes in random sequence.

As Arlin Stoltzfus has already pointed out, there is no particular reason to expect that the initial genes were particularly long. Genes have probably undergone a lot of fusion & rearrangement, yielding the modern long reading frames and exons -- even Senapathy admits this, because he must explain away the intron-less prokaryotic genomes.

JM: For those of your who do not have his book, Dr. Senapathy uses these numbers after taking into account the degeneracy of codons and amino acids:

    longest exon = 600 nucleotides;
    length of random DNA needed for a high probability of finding that 
size exon = 10^20 nts;
    length of random DNA needed for a high probability of finding a gene 
containing that longest exon and one of 400 nts = 10^26 nts;
    Note that most genes have longest exons of only 100-150 nts;
    total DNA available in the pond = 10^30 to 10^35 nts.
    (Reference for above: Chapter 7, pages 286-288)
    Total amount of DNA in a single human being (all cells) = 10^23 nts, 
60 grams (from page 566).

Keith: As your calculation shows, Senapathy's pond contains 10^5-10^10 kilograms of high molecular weight, double-stranded DNA. Biological systems are quite capable of generating this; a serious challenge for any abiogenesis scheme is generating the biomolecules (one was just published in Nature). Senapathy says "no problem" -- and then assumes it will be polymerized, double-stranded, and high-mw (or else his calculations croak from "edge effects" -- you can't run a long gene into DNA which doesn't exist). Furthermore, this DNA is being replicated, transcribed, and translated.

JM: Perhaps this is what you are getting at: Since two of the sentence examples used on page 229, "God heals, and the doctor takes the fee" and "Love is the wisdom of the fool and folly of the wise," plus many other sentences are all found in the same random text sequence, and since two or more such sentences could overlap in the random text, the actual sentence found might be something like "Love is the doctor the wisdom takes of the fool, the fee." However, unlike these word examples, the longest exon in each of two real genes will not likely be near each other relative to the locations of the shorter exons. That is, the two genes are unlikely to overlap because all of the shorter exons will be found very close to the longest exon. Besides, so what if a few of the specified genes do overlap? We aren't looking for any specific gene -- we take whatever we find and test it for viability. Win or loose, just keep going. Dr. Senapathy is only saying that the odds of finding eukaryote genes (and assembling them into viable genomes) are so high as to make it very possible, not nearly impossible....

Keith: And the point is, he has overestimated these odds grossly. He has led you down the garden path by equating splice signals with stop codons, when in reality what little resemblance is probably coincidental (BTW, the consensus for the end of an intron is Yag, where Y=T or C, but C predominates slightly; but, why let the facts get in the way of a cool hypothesis).

JM: (continuing) ...He is not making any statement here about the viability of any particular gene, just that there will be so many genes (viable or not) that a few will "survive" in the pond. He uses the characteristics of known viable genes as a basis for the computations.

Dr. Senapathy does not specifically include the gene start and end signals in his discussion, but I don't see why those signals could not simply be treated as a "null" length exon and included in the search. Since those sequences are short relative to the longest exon, they won't affect the amount of DNA needed to find them. You could argue that since they are so short, they'll be found so often as to goof up your search for a long gene. (Let me know if I'm helping you too much here. :-) Well, they might do that in many cases. But, there is also a reasonable chance that they won't occur. I don't have the numbers on this, but I suppose a couple of weekends of work would produce them.

Keith: You've missed the point -- entirely. English words don't have phasing; mRNA translation does. There is also no real genetic equivalent to spaces -- splice sites are made of the same 4 letters, and their interpretation depends on context (i.e., an "end-splice" signal is irrelevant unless it follows a "begin-splice" signal). So the problem is that when you hit the next random splicing signal, odds are your translation will come to a halt.

He skips over plenty of legitimate English words (exons), some ("awry") longer than the ones he chose.

JM: But that's what makes this work. In the example, he is looking for specific words, but in reality whatever genes occur (along with the other genes forming a genome) may eventually get tested for viability. You didn't want him looking for any specific sequence, so you cannot let yourself do that either, and that includes specific sequences having rogue start/stop signals. Take any gene you find and test it for viability. That random DMA will contain many, many genes. The few viable ones will survive, and they will become more numerous over time.

Now, am I still missing something? If you think so, then can you use some numbers to refute Senapathy's numbers or logic?

Keith: We can divide his theory into two versions,

Senapathy petite Formation of genes by random splicing Senapathy grande Independent Origin of all Species

Senapathy grande is fatally flawed on many levels, for a sampler:

Many organisms development rules out the "seed cell" hypothesis All abiogenesis mixtures are suicidal (hint: which is simpler to form, a fully independent organism able to meet all its needs or a free-loader which slurps the soup?) Homology; there's so much evidence for common origin.

That leaves Senapathy petite (i.e. Senapathian formation of genes, but conventional organismal evolution). Even if we scale it down to meet my criticisms with regard to the negligible gain in ORF size, it is now an "exons-early" theory, and in general the exons-early boat has its gunwales at about the waterline.

In summary, Senapathy's book is a grossly flawed exercise in self-delusion. There is a great abundance of evidence to refute his big claims, and scaling down his claims doesn't put him in good shape either. Also, as a scientific theory Senapathy grande is utterly, absolutely worthless -- though I haven't quite decided if it is because it makes every prediction or no predictions (either way, useless). This stands in great contrast to evolution via common descent, which is a key theory in understanding biology and an important guide to real experimentation. If he hadn't paid to publish it, I would have assumed it was an elaborate parody of cargo-cult science; instead it just is.

[in1-13]

From Keith Robison (reprise): In summary, Senapathy's book is a grossly flawed exercise in self-delusion..... Also, as a scientific theory Senapathy grande is utterly, absolutely worthless -- though I haven't quite decided if it is because it makes every prediction or no predictions (either way, useless).

JM: and suddenly you are doing an awful lot of ranting and raving, but you offer no numbers or information to support your claims that Dr. Senapathy is crazy. Your whole post was that way.

Keith: Okay, I'll admit I was cranky in that post. But the facts still stand: There is a great abundance of evidence to refute his big claims, and scaling down his claims doesn't put him in good shape either.

JM: What evidence refutes his claims? The evidence that supports macroevolution does not count -- you need to show what evidence there is that does not match Senapathy's theory.

Keith: Senapathy makes a number of claims about the statistical properties of introns and exons, which he says are a natural result of his theory (and therefore evidence for). To wit, Senapathy claims explicitly that exon sizes follow an exponential distribution, and his logic implies that intron sizes should follow a similar distribution.

NEITHER IS TRUE!

Exon sizes follow a much more complex distribution (Stoltzfus et al got it right in their Science rebuttal). Intron sizes aren't exponential either (saw a presentation on this last week) -- they looked sort of normal-ish to me.

Senapathy's calculations are hopelessly naive; the real calculation is much more difficult.

JM: OK, show me.

Keith: Again, the real calculation would have to consider the probability of splicing signals -- i.e., what does the distribution of ORF lengths look like after randomly-transcribing and then splicing the mRNAs. The only way I could do it is by simulation, which would be a bit more work than I'm willing to do. Never the less, we can make an intelligent prediction of the result (see below).

... this is utter hogwash. All he is proving is that you can find the sequence in there if you know what you are looking for;

JM: No, he's showing that you can find any sequence (of specified limited length) in a large, given amount of random DNA.

Keith: But, in general, once you blindly transcribe random sequence and splice it at the randomly occurring splice sites, you will basically find it looks like the DNA you started with in terms of the trinucleotide (codon) frequencies...

JM: Yes, so what? I don't think Senapathy is saying otherwise, is he? Where?

Keith: Because he is saying that in a biochemical system, you can find genes in random DNA if you splice, but not if you don't splice. In other words, the splicing process somehow adds information content. But it CAN'T! Because a randomly-transcribed+spliced sequence pool has the same trinucleotide composition as the unspliced starting sequence, the splicing operation has done NOTHING to the probability of finding a long ORF. This is why all Senapathy's calculations are just smoke.

JM: I don't read it that way. It seems to me that Senapathy's random DNA looks like eukaryote DNA. For example, in figure 7.4 (page 236) he writes: "The only way a gene longer than 600 nts could originate was to select some short reading frames and splice them together ... by editing out the intervening regions containing many stop codons. Such a splicing resulted in a long reading frame which could then code for a long protein. In today's biology, the short coding pieces which were spliced together are called exons, and the intervening pieces, the introns." That is, he is saying the short pieces and RFs (before splicing) are the exons and the other stuff make introns.

Where does he say you have to splice before you make the eukaryote gene (complete with introns)? The transcription and splicing is being done after the random hunk of DNA is put into a genome.

Keith: ...previously you were quite adamant that the proper way to do such a calculation is to consider only the observed sequences, not the complete spectrum of potentially interchangeable sequences. Have you changed your mind?

JM: No -- we've changed the subject here. Our previous discussion was about point mutations and getting new genes therefrom. This is about finding any gene in a bunch of random DNA. I have not rejected your objections to the point mutation logic, but we reached an impasse there -- ultimately we both said that it didn't really matter (because, from your point of view Senapathy's model is wrong, and from my point of view there is no model). So, forget the point mutation discussion because it does not apply in any way to the main part of Senapathy's theory regarding the DNA in the pond.

Keith: Jeff, you have missed the point. When we discussed point mutations, you argued (and Senapathy used) that the correct question to ask was what was the probability of evolution arriving at the observed sequence, not the possibility of drawing any one of the possible isofunctional sequences. I am now suggesting that you (and Senapathy) remain consistent -- you must calculate the probability under Senapathy's model of drawing each of the observed genomes, not of drawing any one of the possible isofunctional genomes.

The underlying statistical logic is the same in both arguments, but Senapathy chooses the one to fit his purposes (and you have gone on ala-lemming). In other words, to remain consistent you & Dr. S. must calculate the probability of finding all of the current genomes in the soup. Hint: for the human genome it's 4^[-1*(10^6)]; repeat for all remaining genomes, multiplying the probabilities.

JM: He has done this. With the longest exon specified, the length of random DNA is computed, and within that DNA will probably be found all possible exons of that length. In the word example, all 6-letter words will likely be found in the 3 billion random character sequence.

Keith: So how do you explain all those big exons (there are many greater than 400 nts, as I have posted -- as big as 7Kb as I recall).

JM: Senapathy uses 600 nts as a typical longest value, but agrees there are some that are longer. What's wrong with that? If his average longest exon frequencies are wrong, what are the correct numbers?

Keith: As I posted before, there are many longer exons. Calculate under Senapathy's model the number of exons you would expect to find greater than 600 nts in length.

JM: Note that most genes have longest exons of only 100-150 nts; total DNA available in the pond = 10^30 to 10^35 nts.

Keith: (reprise) As your calculation shows, Senapathy's pond contains 10^5-10^10 kilograms of high molecular weight, double-stranded DNA. Biological systems are quite capable of generating this; a serious challenge for any abiogenesis scheme is generating the biomolecules (one was just published in Nature). Senapathy says "no problem" -- and then assumes it will be polymerized, double-stranded, and high-mw (or else his calculations croak from "edge effects" -- you can't run a long gene into DNA which doesn't exist). Furthermore, this DNA is being replicated, transcribed, and translated.

WHERE DID ALL THAT DNA COME FROM? How is it all so damn long; Senapathy's calculations assume the DNA as one long strand, or at least each strand is much, much longer than a eukaryotic gene. As I have pointed out before, maintaining DNA of such length is a challenge, as DNA is not structurally very strong and will easily break. Where did the transcription machinery and splicing machinery come from?

JM: Random chance, just like everything else. Once things began to work, the machinery was replicated (more often than the stuff that didn't work).

I don't see why it's necessary that the random DNA all be in one long piece. If it was churning about and various long pieces were formed, even briefly, then broken and formed in another sequence, wouldn't that work, too.

Keith: (reprise) You've missed the point -- entirely. English words don't have phasing; mRNA translation does. There is also no real genetic equivalent to spaces -- splice sites are made of the same 4 letters, and their interpretation depends on context (i.e., an "end-splice" signal is irrelevant unless it follows a "begin-splice" signal). So the problem is that when you hit the next random splicing signal, odds are your translation will come to a halt.

JM: OK, it halts. Keep going and won't it eventually restart?

Keith: Nope. Not usually. In bacterial systems, it does frequently restart if there is a start sequence nearby -- but it starts a new peptide chain!! (stop codons are really "stop translation and release peptide" codons) Very rarely will eukaryotic ribosomes restart, and again, it will be a separate protein.

And anyway, this doesn't really matter. Remember, Senapathy is claiming (except where he needs it) that evolution is impossible -- i.e., under his model each genome looks almost exactly like it did the day it emerged from the soup.

And again, Senapathy is explicitly claiming that splicing is the route to long ORFs. What you are attempting to do is find a way around this -- not disputing that Senapathy is wrong. Senapathy is claiming that splicing builds big ORFs, and he's just plain wrong.

JM: If you think Senapathy is wrong, then please correct that part of his theory and change the numbers and recompute the amount of DNA needed. If you are correct, the number will be huge and the DNA unobtainable. That would be a lot of work and it may not be reasonable for you to do the math, but that is what you must do to show that the amount of DNA is not sufficient to satisfy Senapathy's theory.

Keith: I don't need to -- Senapathy has already done it for us, but mislabeled it. Because the spliced sequences look just like the unspliced sequences at the level of ORFs, we can use his calculation: 10^120 nucleotides (to find 200nt ORFs at high frequency).

JM: Just saying he is wrong (more precisely, you said: "Senapathy's book is a grossly flawed exercise in self-delusion.... Also, as a scientific theory Senapathy grande is utterly, absolutely worthless") is not a very convincing argument. Why can't you quantify your arguments as he has done?

Keith: (reprise)And the point is, he has overestimated [the odds of finding eukaryote genes] grossly. He has led you down the garden path by equating splice signals with stop codons, when in reality what little resemblance is probably coincidental.

JM: How should it work? What are your numbers on those odds?

Keith: See above. The point is that his numbers are absolutely meaningless. All those impressive calculations -- irrelevant. Do you start to understand my frustration with this issue? Senapathy snows the readers with all this stuff, when it is completely pointless.

(reprise) Pray tell what is applying the selection??? According to Senapathy's model, no selection occurs until the whole mess is assembled into a "seed cell" (itself a horrifically-flawed concept at odds with much established fact).

JM: What is the established fact that refutes Senapathy's seed cells?

Keith: I've posted this before.

Many metazoans ("multicellular animals") have developmental schemes which require the asymmetric localization of proteins and mRNAs in the ovum. These patterns are laid down by cells in the mother's body. This solves the problem known as "symmetry-breaking" -- how can an apparently symmetric egg generate an asymmetric organism. Senapathy's seed cells would have no such external pattern to impose asymmetric distributions of proteins and RNAs. Furthermore, if a seed cell could develop without them there would be no reason to expect the current requirements for them. Mammals require that half their DNA be marked (probably by methylation at CpG dinucleotides) in order for proper development to occur. One parent's chromosomes are so marked; the other's isn't. Attempts to generate uniparental mice failed because of this fact. Again, Senapathy's pond could not generate such controlled heterogeneity, and we would not expect it to occur if Senapathy's pond-mammals could emerge without it.

(reprise) No, what is most important is that it is unlikely that once you find your gene, that it will have the appropriate constellation of regulatory sites to be expressed in a useful manner.

JM: Isn't this built into Senapathy's DNA? That is, there is a certain probability that you will find the splice signals and so on, and in the proper phase, in the random DNA. If Senapathy is wrong and if you think he's ignored this in his numbers, then please offer some alternative numbers, showing what the result would be if Senapathy had done it "right."

Keith: Again, Senapathy is so good with equations because he picks trivial ones. Doing a good transcriptional signal prediction is HARD -- you must make a lot of assumptions about probabilities (some of these things have very weird locational properties -- still not well understood). I don't have the numbers, and so I won't make pretty-yet-meaningless equations. But, given that there are probably 10^4-10^5 different transcriptional patterns in a human, Senapathy has underestimated things by at least that factor.

(reprise) Again, how is selection acting within the magical pool, when the functions of the genes can't be tested until they exit the pool?

JM: Selection of viable genes is happening outside the pond. Replication is happening inside and outside.

Keith: Selection must be coupled with replication in order for this process (generally called Darwinian selection) to have any effect.

(reprise) All abiogenesis mixtures are suicidal (hint: which is simpler to form, a fully independent organism able to meet all its needs or a free-loader which slurps the soup?)

It takes many genes to make complex biomolecules. The genes to utilize complex biomolecules are common to all life. 1 & 2 imply that there are fewer required genes for building a scavenger vs. a synthesizer. The probability of an organism emerging from a pond is proportional to the number of required genes. 3 + 4 imply that scavengers will emerge more frequently than synthesizers. A single mutation in a synthetic pathway can knock it out, Deleterious mutations are frequent 6+7 imply that synthesizers will frequently mutate to scavengers.

Ergo, scavengers will emerge from the soup. Such scavengers will devour the soup, and while doing so kick out enzymes which degrade the components of the soup (this is a fact of life -- dipping your fingertips into Senapathy's pool would be genomicide on a large scale). Such scavengers would consume the soup.

Conclusion: no abiogenetic soup can long survive abiogenesis.

Jeff, I am wearing out. Let me put it this way -- what does Senapathy's theory NOT predict. And again, Senapathy's retrodiction/ accommodation of homology is based on his genome recycling theory, which I have pointed out is not compatible with the known properties of DNA. I have also pointed out that Senapathy's soup cannot coexist with decomposers, and such decomposers have been around for millennia.

I try to stay calm, but his book is just so maddening! The reason it looks good (in the book) compared to evolution is evolutionary theory is real science with all of its warts and shortcomings exposed, debated, and analyzed. Senapathy presents a glowing picture, failing to present one flaw in the theory. I (and others) have presented many, and they are glaring. Once you clear away all the false statements and "spherical cow" assumptions, there isn't much left of the book.

Put in a fair fight with modern evolutionary theory, Senapathy's theory just doesn't offer any promise.

The discussion is continued in Introns & Exons, part II.

I love my Mac [top] -- [home]

Introns, Exons, and So-ons (Part I)

Summary of the new theory:

From Dr. Senapathy:

Discussion:

Introns, Exons, and So-ons
(Part I)