From Periannan Senapathy
Date: Sepember 12, 1995
I should apologize for being inactive in responding to the internet posts regarding my theory for the past several weeks. I want to do justice to that, and it leads to my replies being long. Hope you will bear with me. I would like to thank those who have participated in the very interesting discussions, both positive and negative to my theory. I am happy that Jeff Mattox has read my book thoroughly and has completely understood the theory. Since he came into the discussion, he has answered most of the questions correctly and effectively. I also want to thank the many who sent personal emails and letters supporting my theory.
1. Why a new theory?
First, let me answer a question: Why do we discuss my theory of the independent birth of organisms in the evolution forum, when we know that those who believe in evolution will be vehemently opposed to it? It is only natural that when people, who already believe that evolution is an established fact, are told that it is not so and that there is another theory that can better explain the scenario of life on earth, they will be emotionally opposed to it and will even be angry. But it is not to anger them that I began discussing my theory here, but to exchange with them the scientific details and to enable them to see the validity in the basic tenets of the new theory. Many here are staunch supporters of evolution theory either due to educational training or due to self-conviction. Yet if one can set aside an emotional attachment to evolution theory, and give a few moments of unbiased thinking to the possibility of the new theory, I am confident that one may see the reasonableness in the new theory. When a theory is not ultimately proven, no matter how much it appears to be proven and no matter how much it appears to be convincing, there can be another theory that can better explain the scenario. The purpose in discussing my theory here is to show that this new theory, which I certainly do not claim as a proven fact, is able to explain the scenario of life on earth better than the evolution theory, and at least as well as the evolution theory. This will at least make people not go in the wrong direction any longer, and to take a fresh look at the whole scenario and to find new answers and explanations.
My theory is like evolution theory in its scientific attitude and basic approaches, except it explains the origins of genes and genomes and the connections and especially the distinctions among organisms in a much better manner. Many here are frustrated as to why do we need a new theory, when the existing evolution theory seems to be adequate to explain the scenario. The existing evolution theory is no where near being adequate to solve many of the problems posed in understanding the origins and in explaining the connectivities and unconnectivities among organisms. We certainly need to be able to find answers to many of the unanswered questions regarding the origins, and to be able to explain may of the unexplained scenarios of life on earth. That is precisely what my theory attempts to do. So please plunge in, and even be emotionally angry, but please do give some unbiased moments of thought and analyses to the scientific details that we discuss here. Perhaps, some of you will become convinced at least to some extent after all this that the new theory is possible and plausible after all!
There have been many newcomers to the discussion on this forum. Perhaps it is worthwhile to also include a historical aspect of my theory, that is, how, being a molecular biologist interested and convinced about evolution theory, I happened to formulate a new theory that is basically opposed to the evolution theory and which is scientifically in no way inferior to the evolution theory. And why I claim that it is in fact scientifically able to explain the scenario of life on earth better than the evolution theory. While I try to answer the questions, the historical perspective may perhaps be helpful to remind the reader why such a new theory is needed in the first place, and what are its nuances.
2. The occurrence of long DNA molecules in the primordial pond.
I am a molecular biologist by training, and a scientist by interest. As I have said before, I have been interested in the question of the origin of life since my college days, as many of us have. My interest has been purely out of an instinctive want into scientific inquiry, and I have no vested interests, religious or otherwise.
As I was researching into the molecular details of the origin of life on earth and pondering over Darwin's theory in order to find an explanation for the origin of the first primitive cell on earth, I found that the existing theories and details really did not have an answer. Just out of my graduate studies and doing post-doc research at the NIH, I found that the theory and works in the field of chemical evolution were good, but were falling short of explaining the origin of the first genes, genome and life of even the most primitive, free living cell possible on earth. These theories were in the right direction, but were far less sufficient to explain the origin of even one complete gene. They proposed and tried to explain things piecemeal, that were far less complete to address the question of the origin of the first free living cell. So I was pondering and researching to find an answer to these questions in order to understand how the genes of the first cell could have been formed, how the genetic code and the genetic machineries could have evolved, if these mechanisms evolved before the first cell was formed or after it was formed, and many other such questions. This was right around the early 1980s, just about when the split structure of the eukaryotic genes had been discovered, and the origin of which structure was a puzzle for all of us working in molecular biology.
While researching and assimilating the details in chemical evolution research, I took a molecular biology approach to understand how the DNA, the genes in the DNA, proteins, genetic code, genetic machineries and the cell itself could have originated in the first place. It was the field of chemical evolution that had taught me to think that all the reactions among the earth's elements and chemicals must have been random in nature, and from which the right combinations could be selected. But it also said that things evolved gradually from simple to complex, from simple biochemicals gradually to complex biochemicals and to the first cell which was most primitive and crude, from which organismal descent with modification, that is Darwinian evolution, took over. The sequence of events that were proposed and the explanations given for the formation, or even if we may call it evolution, of the very first cell were very vague and piecemeal, and very incomplete. There was really a big gap between the proposals, details, and explanations, and the reality of even the most primitive free living cell on earth as we know it. First, based on chemical evolution experiments, people had been given to think that only short oligonucleotides could be formed in the primordial soup, which then had to combine and somehow form genes by means of chemical evolution. It was also vague if the genes were formed fully before the first cell was ever formed, or after that within the cell. And in neither case was there a systematic analysis with regard to the probability, statistics, structure or function of the genes or proteins. The explanations were story-telling type, and were in no way scientifically rigorous. The explanation was that during this process, somehow the genetic code should have been formed, somehow the genetic machineries should have been formed, and somehow the proteins could have been formed, and somehow a primitive cell was formed. While we cannot blame any one for the lack of knowledge then, this was purely vague and was not scientifically satisfying. We needed more concrete science.
While trying to understand these things and researching with random sequences, I asked what if long DNA molecules were formed in the primordial soup. They could be made by many means: short ones could be made by chemical means, these could then be recombined by chemical catalysts -- proteinacious or otherwise (e.g., proteinoids or random peptide mixtures that were not gene-coded) -- to which again random nucleotides could be added. In any case, I could see that there was no reason why long DNA strands could not be formed in the primordial soup. In all the prior proposals concerning chemical evolution, what I saw were arbitrary assumptions and self-imposed limitations in the thinking of the people who proposed them. The original authors such as A. I. Oparin who proposed chemical evolution were doing it with an aim of explaining the primordial chemistry for the supposedly very difficult origin of the first cell that was supposed to be very primitive. Let us not forget that such authors were fully influenced by Darwin's theory, which started its arguments beginning from such an apriorily assumed, simple, primitive, single-cellular life on earth.
During my graduate studies I had worked in a DNA chemistry laboratory, and my professor, T.M. Jacob, had worked with H. G. Khorana (whose group accomplished the first chemical synthesis of a complete gene and who got Nobel prize for it) for a number of years. My first works were chemical synthesis of oligonucleotides (well before the automated synthesizers had come along). As a consequence, questions and concepts that constrained the length of the DNA in the primordial soup in the existing field of chemical evolution did not bother me. For me, DNA could be a strong molecule unless there were enzymes to degrade them, and unless it was placed in a hostile environment, and could be formed in reasonable lengths. Wouldn't any molecule be degraded under conditions conducive for its degradation? So is DNA. But, there are many sets of reasonable conditions under which DNA is very stable. Once we have random oligonucleotides that are tens or hundreds of characters long, then there was no reason why they should not be linked to form strands thousands or hundreds of thousands, or even millions of nucleotides. Except for getting emotionally angry for saying something that is not traditionally accepted, no one can provide a valid scientific reason as to why this could not have happened.
There are millions of species living today, and many millions of individuals for every species. The DNA in each of the chromosomes in each cell is at least tens and up to several hundreds of millions of nucleotides long -- each a very long contiguous molecule indeed. These DNA molecules in each of the trillions of cells in each individual of every organism are perfectly stable. Of course, the reason is that there are other biochemicals including small molecules and proteins protecting them. The essence is: Trillions of DNA molecules, each hundreds of millions of nucleotides long, are perfectly strong and stable, just because some molecules are bound to them and protect them in a conducive environment. Not only that, they perform multitudes of coding functions in each cell making almost no mistakes! In the test tube also, the cloned DNAs, tens of thousands of nucleotides long, is stable for years in just a buffer solution. (If any one has a doubt, I have worked extensively in a cloning laboratory where I had left cloned genes in solution at room temperature for several months, which was fully intact after that.) While these are plain facts, why should we not think that some similar kind of DNA protection could have existed in the primordial pond at least to some extent? If some molecules can protect DNA today to this tremendous extent, why should we not envisage that long DNA molecules could have been protected in the primordial pond -- especially if this concept could lead to a clear understanding of the origin of split-genes and the origin of organisms by a new mechanism? So, let us please not restrict ourselves by the constraining forces of traditional thinking which were purely based on assumptions in the first place, and proceed with the possibility of long DNA molecules in a primordial pond.
3. Random DNA sequences available in a primordial pond would have inevitably contained millions of complete split-genes (genes of multicellular animals and plants) in them.
With this basis, I analyzed the formation of the genes that could code for complete proteins. I did this by using the computer, not the test tube. My primary question was: If the DNA sequences were long and were random in sequence, then how did the genes capable of coding for proteins form? I analyzed the lengths of the proteins of prokaryotes and those of eukaryotes, the lengths of the genes of the prokaryotes and the eukaryotes, the distribution of reading-frames in random DNA sequences that I simulated in the computer and in the DNA sequences of the prokaryotes and the eukaryotes, and the mix and match of all these things in the computer. I used the PIR (Protein Information Resource) from the National Biomedical Research Foundation extensively. GenBank was just being begun then. I wrote many computer programs and also got the help of computer scientists at the Division of Computer Research and Technology at the NIH. I am saying these things only to show that I have not simply looked at the sequences and have said what I have said, and that I have done very extensive computational analysis, both by performing simulations and by comparing the DNA, gene, and protein sequences of actual organisms both prokaryotes and eukaryotes, and those from simulated random DNA sequences. I must also note that such an analysis has never been carried out simply because no one ever has even looked at the possibility of long DNA sequences having fully-formed genes in them, in the manner I have done now. In fact, as we all know now, evolutionary biologists are totally opposed even to the idea of the existence of long random DNA molecules in the primordial pond, which prohibited, disallowed, inhibited, and in a sense proscribed any one to even taking the route that I have taken now.
In my analysis I found out that given long random DNA sequences, genes capable of coding for complete proteins could indeed simply occur fully-formed not as contiguous genes, but as split-genes, in fact, with quite typical structures that are found in all the eukaryotic genes. I simply asked the question: If DNA sequences were random, then could coding sequences occur contiguously in lengths capable of coding for full-length proteins? I tested the sequences for the distribution of reading-frame lengths in all the three reading frames in random sequences, and found that the reading frames were constrained to an upper length limit of about 600 nucleotides. This I found to be true even if I simulated DNA sequences of length millions of nucleotides. I also found that the reading frame lengths were distributed in a negative exponential manner, which meant that the shortest reading frames were the most frequent and the longer and longer ones became rarer very rapidly, in an exponential manner, and they became almost non-existent after about 600 nucleotides. For every order of magnitude increase in the length of the random DNA sequence that I simulated in the computer, there was an increase of only about 10 nucleotides in the upper length limit. That is, reading-frame lengths up to about 600 nucleotides would be present in a random DNA sequence of about a million characters, up to about 610 would be present in a random sequence of about ten million characters, 620 in 100 million characters, 630 in one billion characters, .... 660 in a random DNA sequence of one trillion characters, 690 in a random DNA sequence of one quadrillion characters, and so on. These are only approximate and statistical and are intended to illustrate the concepts. This means that even if we have very long DNA sequences, long contiguous coding sequences that could code for full length proteins could not simply occur in them. Therefore, the long, contiguous genes of prokaryotes that code for long proteins (whose reading-frames go up to many thousands of nucleotides in length) could not have simply occurred even in very long random DNA sequences. However, as I said before, just around 1980 the split structure of the genes of the eukaryotic organisms had been discovered. So, I could correlate this knowledge with the structure of genes that were possible from random DNA sequences. It then became clear that if coding sequences could occur in pieces within the "length-constrained" reading frames of a long random DNA sequence, then they could be chosen in a consecutive manner piecewise, and then could be successively combined together to form a long, contiguous coding sequence.
Once I figured out this possibility, then I did a number of tests and analyses to confirm this concept. In fact, when I tested the reading-frame lengths of actual eukaryotic genes, they looked exactly as predicted from the random sequences. Furthermore, when I tested the lengths of exons, they were all under the upper length limit of the reading-frames. I even graphically tested all the reading frames, and the exons within them. This prediction became verified. (Please note that this is statistically very true. There are some exons which are much longer, but these can be derived by the loss of some introns. This should answer a question that Keith Robison raised about the presence of longer exons.) The idea is that all the exons should be statistically under a length limit of about 600 nucleotides. They would have a tendency to be chosen from the longer reading-frames, which are still within the constrained length-limits, rather than the shorter ones. It means that the exons will be under the upper length limit, but would be more frequently the longer ones. When we normalize this kind of distribution for the frequencies of the lengths of reading frames, then we would see that the exons will start with a lower length limit of about a few characters and will peak around 100-200 characters. This is because the longer reading frames even within the 600 character upper limit are very rare. I have stated that the exons will be chosen within the available long reading-frames, and, as also stated in a commentary on this work in New Scientist then, will be chosen from the best of the coding-pieces from among these length-constrained reading-frames. While the distribution of the reading-frame lengths is negative exponential, the distribution of the exon lengths will not be negative exponential, but will have a normal distribution under the negative exponential curve. I have described this several times before. This is what has been noted in a commentary by Stoltzfus et al in a recent issue of Science magazine. Stoltzfus has not said anything new, or anything that would contradict my concepts or data that I have provided before. He has simply shown graphically what I have said many times in descriptive English. It does not contradict me as Keith Robison has incorrectly said in one of his recent SBE posts.
When these fundamental things became clear, I also analyzed the amount of random DNA needed which could contain the genes with a probability that is close to being one. This amount, ~10^26 nucleotides, is in fact very small in terms of physical quantity. Consider that an individual creature of the size of a dog or human contains about 50 grams of DNA, and about 10^23-10^24 nucleotides. This would clarify that the amount of random DNA required in a primordial pond for the set of all genes to occur in it is not very much indeed. I then conducted extensive computer simulation experiments, wherein I simulated random DNA sequences in which I searched for specific genes. I could not simulate the 10^26 characters since we cannot do this in a reasonable time-frame in today's computers. But, I simulated random DNA totalling to many billion characters on a SUN workstation, and searched for genes that were shorter, but using the principles of long split-genes. In fact, I used portions of actual genes. The results proved the concept that I started with, that far less length of random DNA sequence was needed when the features of eukaryotic genes: namely, split-structure, codon degeneracy in genes and amino acid degeneracy in proteins, were used at precisely the expected extents than that traditionally believed for the same length of contiguous genes. No contiguous genes as those in prokaryotes could be obtained by such searches even in a thousand times longer DNA sequence. Another important thing is that no matter what protein sequence is searched for, the gene coding for that protein will occur within exactly the same random sequence. This answers the question that many posters have asked, including Keith Robison. I am not searching for one specific gene sequence. I can search for any specific gene sequence in the same random sequence, and yet we will obtain that particular gene sequence somewhere within the same random sequence. This is the power of this approach that shows that almost any gene coding for any protein specifying any biochemical function will occur in the same random sequence of 10^26 characters. In fact, one more interesting thing about this concept is that any new random sequence, as long as its total length is ~10^26 nucleotides, will contain the set of almost all genes.
Another interesting thing about the random DNA available in a primordial pond is that it can contain many distinct genes that can code for essentially the same protein. Short sequences such as the homeobox sequences or enhancer sequences can occur millions of times independently within this amount of primordial DNA, as they are fairly short and exhibit sequence variation. Many genes coding for multifunctional proteins can exist in independent random sequences with many of their parts being similar purely by chance. All these are consistent with what we see today in the genomes of multicellular organisms.
In one of his SBE posts, Keith Robison has asked if the gene for a cytochrome C protein could occur within the 10^30 nucleotides that could be available in a pond. Taking the whole cytochrome C protein sequence, and the contiguous DNA sequence needed to code for the complete protein, he expects that the probability of finding it is 10^-112, meaning that it would take a random DNA sequence of ~10^112 nucleotides for the gene to occur in it once. This approach is exactly what I say is totally wrong. This kind of assumption is what has been making the evolutionists to go in exactly the wrong direction -- in the direction exactly opposite to where the truth exists. One of the major purposes in my book is to show that split genes are tremendously far more probable in a primordial pond than has been assumed for any gene by such people. I have posted many of the details in SBE before, so I will not go into the details. The details regarding how genes could exist in a small primordial pond in abundance are described in a full chapter ("The Abundant Occurrence of Genes in the Primordial Pond") in my book. Also, as Jeff Mattox responded to Keith's question recently, it is the exons that should be taken into account and not the whole gene. In addition, codon degeneracy and amino acid degeneracy has to be taken into account. When we do these, the probability of a complete gene, no matter how long it is and no matter what sequence it codes for -- as long as the longest exon is around 600 nucleotides -- is close to being 1. Again let people like Keith not get emotional and say that they know of eukaryotic genes that have exons longer than 600 nucleotide. I have said many times that we are dealing with statistically observable details here on the one hand, and on the other hand, the longer exons can be easily be derived by partial gene-processing (that is, by losing one or more introns through the messenger RNA-reverse transcriptase processing), which also I have explained in my previous publications.
Thus by the split-gene method I have delineated, not only a gene for cytochrome C, but almost any gene coding for any protein sequence can occur in its full form within the 10^26 nucleotides of primordial random DNA. I've provided enough simulation experiments in my book that would demonstrate this. Such potential of the primordial pond is truly amazing, but yet it is an absolute reality. And it is important for us to understand this potential, for it inevitably enables the multiple origins of genomes and organisms. It is not enough for people to simply say that they cannot believe what I have demonstrated. I have conducted these extensive simulation experiments and they have not! I cannot provide the details of these simulations and graphs that take dozens of pages here, but my book is open to any one who would like to know of the details.
[NOTE: For a demonstration of these concepts, inlcuding how easily complete split genes can be found in a random sequence of DNA, try out the interactive Exon/Gene Search Engine. JM]
4. Splice-junctions in eukaryotic genes are perfectly explained by the new theory.
Under my theory on the origin of split-genes, the length of exons are constrained due to the random distribution of stop-codons in a random DNA sequence. The stop-codons present at the ends of a reading-frame will occur at the exon-intron junctions in such a manner that they can be "spliced-out" along with the introns, so that the exons spliced-together will have a contiguous reading-frame. When we analyze the actual genes, it is extremely interesting to note this very presence of stop-codons at the ends of exons at exactly the place where it is predicted in almost all the genes in today's living organisms. In fact, they are parts of what are called splice-junction sequences that appear at every exon-intron and intron-exon junction, which led me to propose that these splice-junction sequences originated from stop-codons, and primarily due to the reasons of avoiding the reading-frame length constraint.
In understanding the origin of split genes, some people like Keith Robison and Arlin Stoltzfus have raised some questions regarding the phase of the stop codons in the splice-junction sequences. While Stoltzfus et al have stated that my concept about the origin of splice-junction sequences from the stop-codons is attractive, they say that the problem of "reading-phase" had not been addressed. In fact, I have addressed this in my publications, which Arlin Stoltzfus or Keith Robison do not seem to have read. The main idea is that initially when the genes were chosen from random primordial DNA sequences, the stop codons in the genes would have been in one reading frame. Once the mechanism had been chosen in the primordial pond, then the sequence of the splice-junction per se takes over. It means that, from then on, it did not matter if the stop codons within the splice junctions were in phase with the first exon or not, as long as the spliced exons produced an uninterrupted contiguous coding-sequence. Thus, the stop codons within the splice junctions in the genes that were chosen later will not all be in the first reading-frame. The only explanation for the origin of splice junctions in split-genes is the one that I have provided, and no other theory even comes any close to explaining the origin of splice junctions. My theory gives a reason for the reading frames of eukaryotic genes being statistically shorter than about 600 nucleotides, for the exons being statistically shorter than about the same upper length limit of 600 nucleotides, the splice junctions containing the stop codons precisely where they are expected, absence of stop-codons in genes other than those that code for proteins (such as tRNA and rRNA coding genes), and so on. None of the other theories on the origin of split-genes, either introns-early or intron-late, can explain any of these features to even the slightest extent.
An aside point here. One may wonder how a complex machinery as the spliceosome could originate in the primordial pond prebiotically. The question is whether a primitive system evolved into this high complexity or the right system was chosen from thousands of random kinds of molecular machineries which was then fine-tuned. My answer to this question would be the later. Out of may kinds of machineries, the ones that would have meaning for a living cell would have been selected. Of course, there would have been a considerable amount of fine-tuning through molecular evolution of the basic system further prebiotically and within the cell. There may have been systems that spliced together only introns, or some other features of DNA sequences, that were not useful to the life of a cell and which were not chosen. I have dealt with this question in the book with considerable detail.
5. Complexity first and simplicity next: Seemingly complex eukaryotic genes and cells are far more probable than the apparently simpler prokaryotic genes and cells.
The above findings are very important not just for the understanding the structures of genes of the eukaryotes, but for our understanding of the whole scenario of life on earth. These concepts and results showed that:
Some SBE posters here have argued that whether introns-early or-late is immaterial to our discussions concerning the origin of life and organisms. To them, I would like to say that it matters the most. Our understanding of the origin of the split structure of eukaryotic genes is most fundamental to our learning about the origin of life and organisms. Among other things, it has shown us how the eukaryotic genes can directly arise (in fact simply occur fully-formed) from random primordial DNA sequences, how the eukaryotic cell can originate directly from the primordial pond, and how the genome of a multicellular creature can arise directly into its seed-cell and develop into the organism. Thus, it avoids the necessity for the series of assumptions after assumptions that the bacterial genome and cell had evolved first, and then changed into the eukaryotic cell, and then into a few-celled, supposedly simple multicellular creature, and then into other, more complex multicellular creatures, each and all of which steps are simply improbable, and for none of which steps any scientific evidence or explanation exists.
6. From split-genes (primordial DNA sequences) directly to genomes for complex organisms: Origin of many kinds of similarities directly from the primordial pond, not by organismal evolution.
I must say that the knowledge of the split structure of eukaryotic genes that was unraveled around 1980 was pivotal to me not only for asking questions about their origins, but also for finding answers to the larger questions concerning the origin of life and organisms. The answers to the questions concerning the origin of genes also provided answers to many of the questions on the origin of life and organisms. These answers are:
[top] -- [The new theory home page] -- [Part II] -- [Part III]