whalebeings.com

Translating Nucleic Acid Sequences to Protein Sequences with Python

Written on

Chapter 1: Understanding the Central Dogma of Molecular Biology

The foundational principle of molecular biology, known as the central dogma, asserts that genetic information in DNA is initially transcribed into a temporary form, known as messenger RNA (mRNA), and subsequently translated into proteins. This process is similar to language translation, where we convert text from one language to another, like English to French. In this analogy, nucleic acids (A, T, C, and G) are transformed into the language of amino acids, of which there are 20 different types.

To facilitate this translation, we need a "Rosetta Stone"—a codon table that acts as a key to decode nucleic acids into their corresponding amino acids.

Section 1.1: The Concept of Codons

Due to the fact that we have four nucleic acids and twenty amino acids, there isn't a direct one-to-one mapping. To understand how many nucleic acids we need for effective translation, let's consider reading them in groups. If we analyze two nucleic acids at once, we have 16 possible combinations (AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT), which still isn't sufficient. However, reading three nucleic acids simultaneously opens up 64 combinations. This is more than enough to cover our amino acids, thanks to the redundancy in the genetic code, which enhances its resilience against mutations. A group of three nucleic acids is referred to as a "codon."

Subsection 1.1.1: Variations in Codon Tables

Section 1.2: The Role of Ribosomes in Translation

In summary, translation is the process of interpreting a nucleic acid sequence (mRNA) to produce an amino acid sequence (protein). This transforms the genetic code into a functional protein. We will explore how to simulate this biological process using Python to convert a nucleic acid sequence into its corresponding amino acids.

Chapter 2: Constructing the Codon Table

To begin, we must establish the codon table, which can be conveniently represented as a dictionary in Python. A dictionary allows us to pair each codon (key) with its respective amino acid (value). This process requires us to input the codon information manually.

# Standard codon table

codon_table = {

"AAA": "K", "AAC": "N", "AAG": "K", "AAT": "N",

"ACA": "T", "ACC": "T", "ACG": "T", "ACT": "T",

"AGA": "R", "AGC": "S", "AGG": "R", "AGT": "S",

"ATA": "I", "ATC": "I", "ATG": "M", "ATT": "I",

"CAA": "Q", "CAC": "H", "CAG": "Q", "CAT": "H",

"CCA": "P", "CCC": "P", "CCG": "P", "CCT": "P",

"CGA": "R", "CGC": "R", "CGG": "R", "CGT": "R",

"CTA": "L", "CTC": "L", "CTG": "L", "CTT": "L",

"GAA": "E", "GAC": "D", "GAG": "E", "GAT": "D",

"GCA": "A", "GCC": "A", "GCG": "A", "GCT": "A",

"GGA": "G", "GGC": "G", "GGG": "G", "GGT": "G",

"GTA": "V", "GTC": "V", "GTG": "V", "GTT": "V",

"TAA": "*", "TAC": "Y", "TAG": "*", "TAT": "Y",

"TCA": "S", "TCC": "S", "TCG": "S", "TCT": "S",

"TGA": "*", "TGC": "C", "TGG": "W", "TGT": "C",

"TTA": "L", "TTC": "F", "TTG": "L", "TTT": "F",

}

As observed, there is redundancy in the codon table—multiple codons can correspond to the same amino acid, such as "GTA," "GTC," "GTG," and "GTT," all encoding valine (V). Additionally, some codons signal the termination of translation, represented by the stop codon (*). This codon table is formulated using DNA; it can also be adapted for RNA by substituting "T" with "U."

Chapter 3: Extracting Codons from a DNA Sequence

When given a DNA sequence, the next step is to translate it. We achieve this by reading the sequence in triplets, using each trio as a codon to reference our codon table. Let’s define a DNA sequence.

# Specify a DNA sequence

dna = "ATGTATTCAGAGCAGTAA"

To extract the codons from this sequence, we can manually segment it into triplets: ATG TAT TCA GAG CAG TAA. However, we can also automate this process in Python with a loop.

# Loop over each codon in sequence and print it out

for i in range(0, len(dna) - len(dna) % 3, 3):

codon = dna[i:i+3]

print(codon)

Expected output:

ATG

TAT

TCA

GAG

CAG

TAA

The expression "len(dna) - len(dna) % 3" helps determine how many iterations to perform. The first part, "len(dna)," ensures we consider the entire sequence, while the second part excludes any remaining nucleotides that cannot form a complete codon.

Chapter 4: Bringing It All Together

Now that we have our codon table and a method for iterating through the DNA sequence, we can proceed to look up the codons using our dictionary. We will concatenate the corresponding amino acids to build the final sequence.

# Loop over each codon in sequence and grow AA sequence

amino_acids = ""

for i in range(0, len(dna) - len(dna) % 3, 3):

codon = dna[i:i+3]

amino_acids += codon_table[codon]

# Print out amino acid sequence

print(amino_acids)

Expected output:

MYSEQ*

That’s it! While there are simpler methods available, such as using Biopython, understanding this foundational process is crucial for appreciating how these scripts function.

This video titled "Protein translation from RNA sequence using PYTHON | Bioinformatics | Akash Mitra" dives deeper into the process of translating RNA sequences into proteins using Python, providing practical examples and explanations.

The video "Translation from DNA to protein, Python - YouTube" further explores the methodologies for translating DNA sequences into protein sequences, emphasizing Python programming techniques.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

# Finding Beauty in Everyday Life: A Journey of Presence

Discover the beauty of the present moment and the art of mindful living.

Finding Freedom: Overcoming Fear Through Spiritual Growth

Explore how to conquer fear and embrace freedom through self-awareness and spiritual practices.

How I Earned Over $1,100 on Medium in My Third Month

Discover how I made over $1,100 on Medium in just three months and learn valuable lessons to avoid common pitfalls.