Exploring COVID19 Genome from scratch - can we discover a gene ?

Aids Virus Infection, Drawing : News Photo

Social Distancing in this troubling time of pandemic is leading people to endless boredom and news feed scrolling. My geeky mind however found a small openning among the monotonicity. In this article we shall explore the reference Genome of COVID19 released by China! 

This article will require some understanding of algorithms and molecular biology (mainly, The central dogma) :-) However, I shall try to be as clear as possible in my explanations. Let's dig into the annoying pesky virus's genetic code and see if we enjoy it !

Before We Start

Let's talk about COVID19 first. It's a coronavirus, one of many different of coronaviruses. Coronaviruses are generally single stranded RNA viruses with largest genomes among RNA viruses.  

It undergoes a two step replication process. Normally viruses have translation initiation site, however, coronaviruses can have as many as ten of them. What happens is that the virus uses host cell ribosome to translate one large gene called Replicase which basically translates into proteins (enzymes) that use the rest of the viral genome as a template to produce a set of smaller m-RNA molecules, which are then translated into proteins, which make the protein body of the virus' new progeny. 
The COVID19 genome we will use comes from this NCBI page. Notice, although the virus is an RNA virus, the genome is written in DNA FASTA format. (which simply means neucleotide T is used instead of U.). So, we shall use the DNA language for the rest of the post.

We shall use Python, Shell and pre-written algorithms in C and Java (I shall give the link to implementations, no worries.). Although I shall not be able to give full description of all the code but I shall explain the main idea wherever necessary.

Start From the Basic : How Many of A,T,G,C 

This is the first question we should ask. Although it seems trivial, you shall immediately see the reason.  Let's load the genome as a string first, turns out it has about 30,000 bases.
Now, let's use the handy Counter method to count the number of each type of neucleotide in the genome.

This code snippet generates a simple bar plot with the vertical axis showing appearence count of each neucleotide. Let's have a look !
A and T are more prevalent than G and C ! Here's an evidence showing that DNA sequences are highly non-random. (we would expect all four of the neucleotides to be of around equal amount if it was a random string of four letters.) As you read on, you shall see, this non-randomness of DNA will help us significantly in our exploration.

Also, you might notice that number of A and T are similar and G and C are similar. This is NOT because of AT and GC base pairing like double stranded DNA! Remember COVID19 has a single stranded RNA as genome :-) I shall leave it for you to wonder why we see this similarity between A and T, and, G and C counts. I would be glad to hear your thoughts in the comment section.

So, Let's Look for a Gene, Can We Do That ?

The genome is a single stranded RNA. Therefore, any gene will have to start with a start-codon (the triplet of neucleotides on an m-RNA, that ribosomes decide to start translation from.), which is ATG.

So, Let's try to find occurences of ATG and they will give us start positions of genes ? How to do that ?  We can simply slide the pattern along the genome from and keep matching by shifting one position until we find a match. If our pattern is of length P and the genome is of length T then it will O(PT) time (i.e. kind of proportional to P multiplied by T). For this virus the Genome size (T) is only 30k and pattern P is here ATG, size 3. So, it should work out pretty smooth. However, if you try this approach for some larger genome (e.g. human, 3 billion neucleotides) it will not cut it, the time needed to finish your search will be too long to be useful. I use a faster pattern matching algorithm (Boyer-Moore algorithm), which I implemented earlier this year. It takes O(T) time, which is the best possible scenario. The details of the algorithm is beyond scope of this article. Anyway, let's see what happens:  

We found 725 occurences of ATG ! We surely cannot  have that many genes within 30k bases. So, most of them are not acting as start codons in reality. Probably many of them are falling in non-functional genomic region or inside a gene. Although, we know some of them are at the start of the genome ! Can we narrow them down ?  

There's More Going on at the Start of a Gene

When RNA starts getting translated, there's a whole bunch of ribosomal proteins that has to bind in the region precedding the start-codon ATG (also, known as UTR or UnTranslated Region). And these proteins do not bind to random sequencnes. if you think about it carefully, you will realise that for all translation initiation sites in the genome there has to be some sequences that appear more than they randomly should and act as signature for translation machinery to bind. These sequences are generally 8-10 neucleotide long. They are called "Motif"s. Note, that a motif can be slightly different from occurence to occurence but they have a overall statistical preference of being a particular sequence. e.g. "TTGATTGA" might be a motif which might occur in many places with slight variations: "TTGCTTGA", "TTGATTGG" , "TTGATTGA" etc. etc. 

Let's do a quick math and see the non-randomness about them. If we assume "TTGATTGA" as a sequence occured by chance with , then the probability of that happening  by chance is (1/4) to the power of 8th which is 1 in 65536, i.e. they might not even occur in a 30k bases long string. But, as you shall see we can find this kind of sequences appearing exactly as "TTGATTGA" two, three or even more times within this 30k size genome. And, if we consider they can vary slightly from occurence to occurence, then they occur even more.

So, what we need is an algorithm that can take multiple possible UTR regions as input and a motif length k (8 to 10 in general, the right number can be determined using slight trial and error). And the alorithm will give possible k length motifs that occur in non-random manner in those possible UTR sequences. Such an algorithm is Motif finding via Gibbs Sampling. It is a randomised optimisation algorithm which should be pretty interesting for people interested in machine learning or probabilistic modelling.

I did a Java implementation of this Randomised Motif Search years ago. Let's call it to action and look at the top 10 motifs discovered by it ! But, before that where do we get the candidate UTR regions ? 

Easy! we already found 725 occurences of ATG ! The UTR regions must be preceeding them. SO, let's take 50 neucleotide long chunks preceeding each of them as our candidate UTR regions. The following code does this job and prints them into a file conforming the input format my Randomised Motif Search Java program.  

Now, let's run the Randomised Motif Search and get our top 10 motifs, with k  = 8 and as we see, we got our 10 motifs which are pretty similar looking actually!    

We can quickly go to this website and create a motif logo using these 10 sequences! The x axis shows the 8 positions of the motif and Y axis shows information content (read that, non-randomness) in each position through an entropy measure. You can experiment different values of k and considering different number of top motifs. You shall discover why k=8 seemed to be reasonable choice for me pretty easily. Let me know in the comments if you do that :-) 

So, we have ended up with roughly two kinds of motifs. One goes like "TGTG.." and another goes like   "GCCG..". Now, these motifs are of length 8 and they won't easily occur but the algorithm chose them because they were somewhat occur more than random chance. 

The following code can be used to search for each of those 10 motifs and see where they occured among our 725 candidate UTR regions ! At this point, let me be a little picky and choose the second highest motif given by the algorithm: "GCCGATCA" and let's see where it occurs 'exactly' among the  UTR candidates. (note that it occurs in slightly varied manner in other candidate UTRs too, that's why the algorithm chose it among the top ones!)

 And here's the output of that :

position of ATG for this occurence:  266

Ok, so, it occurs in one of them exactly as it is. Good ! And believe it or not, we just found the start position of the largest Gene of COVID19. How do I know ? We are not the first one doing this analysis. Researchers across the world went through these tedious steps and many more already, along with experimental verification and the map of the COVID19 refseq is presented here.  You can go and look at what happens at position 266. Or, I can put up a screenshot here:

Look at that ! The Gene "orf1ab" starts from position 266.  We found it using almost nothing but two key points : The start codon is ATG and UTR sequences are non-random. That sounds so cool ! But, how about the other motifs and what happens if we search them ? Well, you can do it and you might find other interesting regions of the genome :-)

We can do many more stuff with this pesky little viral genome. All that I did above is nothing but the surface of the iceberg. However, it gives me tremendous joy to be able to find electron microscopical truths with nothing but a little bit of maths, reasoning and interplay of 0s and 1s. I hope you enjoy it too.

For the enthusiastic ones, here is the full code (including my implementation of Boyer-Moore and MotifSearch): https://github.com/timkartar/ExploreCOVID19 . Feel free to contribute and raise issue if you have any question there.

Will look forward to your comments. Best Wishes and stay safe everyone. We shall get throuhg this pretty soon.

See y'all again !


  1. It's really nice, how you've reached the point of identification of the gene.
    To your question as to why A T and G C seem to occur in similar numbers, the answer is that, even though you have a positive sense single stranded RNA in here, the strand has a secondary structure with arms and loops where A is complementary to T and G is complimentary to C, so the single stranded RNA basically coils on itself to become double stranded at places and only uncoils during the central dogma .
    And this ratio is not just applicable to RNA viruses but to all organisms irrespective of whether the genetic material is RNA or DNA.
    The 75 ATG sequences are also interesting, because ATG not only acts as a start codon but also codes for Methionine. So if you have 75 ATG sequences and only a handful of the 8base preceding sequence, it actually means that the rest either code for a Methionine or they are non-translational.
    Logically even those two can be distinguished by :
    1. The ATG sequences lying in between a start codon ( i.e. those ATG with a upstream 8 base sequence) and a stop codon (there are 3 different ones) code for Methionine.
    2. The ATG sequences which don't meet the above criteria are either start codons( which you've already identified by preceding 8 base sequence) or they are non-translational i.e. they probably don't code for a protein .
    So if you have some more time, maybe you could make that algorithm for fun ! Good work though !

    1. This comment has been removed by the author.

    2. Thanks. You are right. Single strand base pairing is most likely the answer.

      And about the other ATGs, you are right again. A logical extension, Using the stop codon. Originally I thought of filtering those 725 ATGs that way even before going into motifs. (Makes the motif search faster and accurate). But, i skipped over that part as I did not want to make it too lengthy for general audience. but, you are right, ( and it would take just one or two extra line of code to do it) I mentioned what you said (the possibility of most of them being inside a open reading frame or untranslated region) in words and went directly into the next step.
      On hindsight, i could have just added it. If i get another window of opportunity in future, I shall update and add it.

  2. Great Article! I got too much information regards abroad studies and I will follow this tips. Thanks for sharing such a helpful article. Click here to more information about it


Post a Comment