Translating Nucleic Acid Sequences to Protein Sequences with Python
Written on
Chapter 1: Understanding the Central Dogma of Molecular Biology
The foundational principle of molecular biology, known as the central dogma, asserts that genetic information in DNA is initially transcribed into a temporary form, known as messenger RNA (mRNA), and subsequently translated into proteins. This process is similar to language translation, where we convert text from one language to another, like English to French. In this analogy, nucleic acids (A, T, C, and G) are transformed into the language of amino acids, of which there are 20 different types.
To facilitate this translation, we need a "Rosetta Stone"—a codon table that acts as a key to decode nucleic acids into their corresponding amino acids.
Section 1.1: The Concept of Codons
Due to the fact that we have four nucleic acids and twenty amino acids, there isn't a direct one-to-one mapping. To understand how many nucleic acids we need for effective translation, let's consider reading them in groups. If we analyze two nucleic acids at once, we have 16 possible combinations (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT), which still isn't sufficient. However, reading three nucleic acids simultaneously opens up 64 combinations. This is more than enough to cover our amino acids, thanks to the redundancy in the genetic code, which enhances its resilience against mutations. A group of three nucleic acids is referred to as a "codon."
Subsection 1.1.1: Variations in Codon Tables
Section 1.2: The Role of Ribosomes in Translation
In summary, translation is the process of interpreting a nucleic acid sequence (mRNA) to produce an amino acid sequence (protein). This transforms the genetic code into a functional protein. We will explore how to simulate this biological process using Python to convert a nucleic acid sequence into its corresponding amino acids.
Chapter 2: Constructing the Codon Table
To begin, we must establish the codon table, which can be conveniently represented as a dictionary in Python. A dictionary allows us to pair each codon (key) with its respective amino acid (value). This process requires us to input the codon information manually.
# Standard codon table
codon_table = {
"AAA": "K", "AAC": "N", "AAG": "K", "AAT": "N",
"ACA": "T", "ACC": "T", "ACG": "T", "ACT": "T",
"AGA": "R", "AGC": "S", "AGG": "R", "AGT": "S",
"ATA": "I", "ATC": "I", "ATG": "M", "ATT": "I",
"CAA": "Q", "CAC": "H", "CAG": "Q", "CAT": "H",
"CCA": "P", "CCC": "P", "CCG": "P", "CCT": "P",
"CGA": "R", "CGC": "R", "CGG": "R", "CGT": "R",
"CTA": "L", "CTC": "L", "CTG": "L", "CTT": "L",
"GAA": "E", "GAC": "D", "GAG": "E", "GAT": "D",
"GCA": "A", "GCC": "A", "GCG": "A", "GCT": "A",
"GGA": "G", "GGC": "G", "GGG": "G", "GGT": "G",
"GTA": "V", "GTC": "V", "GTG": "V", "GTT": "V",
"TAA": "*", "TAC": "Y", "TAG": "*", "TAT": "Y",
"TCA": "S", "TCC": "S", "TCG": "S", "TCT": "S",
"TGA": "*", "TGC": "C", "TGG": "W", "TGT": "C",
"TTA": "L", "TTC": "F", "TTG": "L", "TTT": "F",
}
As observed, there is redundancy in the codon table—multiple codons can correspond to the same amino acid, such as "GTA," "GTC," "GTG," and "GTT," all encoding valine (V). Additionally, some codons signal the termination of translation, represented by the stop codon (*). This codon table is formulated using DNA; it can also be adapted for RNA by substituting "T" with "U."
Chapter 3: Extracting Codons from a DNA Sequence
When given a DNA sequence, the next step is to translate it. We achieve this by reading the sequence in triplets, using each trio as a codon to reference our codon table. Let’s define a DNA sequence.
# Specify a DNA sequence
dna = "ATGTATTCAGAGCAGTAA"
To extract the codons from this sequence, we can manually segment it into triplets: ATG TAT TCA GAG CAG TAA. However, we can also automate this process in Python with a loop.
# Loop over each codon in sequence and print it out
for i in range(0, len(dna) - len(dna) % 3, 3):
codon = dna[i:i+3]
print(codon)
Expected output:
ATG
TAT
TCA
GAG
CAG
TAA
The expression "len(dna) - len(dna) % 3" helps determine how many iterations to perform. The first part, "len(dna)," ensures we consider the entire sequence, while the second part excludes any remaining nucleotides that cannot form a complete codon.
Chapter 4: Bringing It All Together
Now that we have our codon table and a method for iterating through the DNA sequence, we can proceed to look up the codons using our dictionary. We will concatenate the corresponding amino acids to build the final sequence.
# Loop over each codon in sequence and grow AA sequence
amino_acids = ""
for i in range(0, len(dna) - len(dna) % 3, 3):
codon = dna[i:i+3]
amino_acids += codon_table[codon]
# Print out amino acid sequence
print(amino_acids)
Expected output:
MYSEQ*
That’s it! While there are simpler methods available, such as using Biopython, understanding this foundational process is crucial for appreciating how these scripts function.
This video titled "Protein translation from RNA sequence using PYTHON | Bioinformatics | Akash Mitra" dives deeper into the process of translating RNA sequences into proteins using Python, providing practical examples and explanations.
The video "Translation from DNA to protein, Python - YouTube" further explores the methodologies for translating DNA sequences into protein sequences, emphasizing Python programming techniques.