Background

Several institutes are collecting large databases of DNA information. The GenBank (http://www.ncbi.nlm.nih.gov/genbank/) sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. To a biologist, a "sequence" is the order of amino acids in a protein or nucleic acids in a DNA or RNA molecule. For bioinformatics work, a "sequence" is any string of characters over any (specified) alphabet. The purpose of this assignment is to learn and develop a simple Python program capable of reading one of the most popular sequence format files, Genbank.

Genbank files are structured documents for the display of biological information. The file format is strictly defined to allow easy reading by both humans and computer programs. The file “NC000018.gbk” is a genbank file and contains a selected region from human Chromosome 18 believed to be responsible for the Nova Scotia Niemann-Pick disease (NSNPD).

Description of the Problem

Write a program in Python that reads the “NC000018.gbk” file and outputs all protein sequences encoded from all of the coding DNA strings for all of the CDS entries in the input file.

Hints

In order to do this assignment, you will need to learn some Python. You should read Chapter 6 from "Python for Bioinformatics". The Python file “genbank.py” contains all of the functions needed to complete this assignment. The following is the list of functions that your program should import from the “genbank.py” file: ReadGenbank, ParseDNA, FindKeywordLocs, GeneLocs, GetCodingDNA, Codons, and Codons2Protein.

Important Points

  • Document your code clearly! It is particularly important to have a block comment at the beginning of the program that documents what the program is supposed to do. Your documentation should be fairly complete, giving all the user full information about how to use the program, and should be clear enough that someone new to bioinformatics can still grasp what it does.
  • The first block comment of the program should contain the following info:
    • Your name and the date.
    • Course number and course title.
    • Assignment # and name of the program.
  • Name your program as “genbankParsing_lastName_firstName.py”, where lastName and firstName are your last name and first name, respectively. Submit your program via Blackboard.
  • Your program should output 11 protein sequences. You may compare your output with my output file “NC000018.proteins.txt”

Bonus Points

  • You may modify the program to read any genbank file.
  • You may provide detail error messages for unusual conditions, for example:
    • Warning: input file is not provided
    • Warning: input file is not in the genbank format
Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.