The deoxyribonucleic acid (DNA) is a molecule that contains the genetic instructions required for the development and functioning of all known living organisms. The basic double-helix structure of the DNA was co-discovered by Prof. Francis Crick, a long-time faculty member at UCSD. See image.

TheDNAmoleculeconsistsofalongsequenceoffournucleotidebases: adenine(A),cytosine(C), gua- nine (G) and thymine (T). Since this molecule contains all the genetic information of a living organism, geneticists are interested in understanding the roles of the variuos DNA sequence patterns that are con- tinuously being discovered worldwide. One of the most common methods to identify the role of a DNA sequence is to compare it with other DNA sequences, whose functionality is already known. The more similar such DNA sequences are, the more likely it is that they will function similarly.

Your task is to write a C program, called dna.c, that reads three DNA sequences from a file called dna input.dat and prints the results of a comparison between each pair of sequences to the file dna output.dat. The input file dna input.dat consists of three lines. Each line is a single se- quence of characters from the set {A, C, G, T}, that appear without spaces in some order, terminated by theend of linecharacter n. You can assumethat the three lines contain thesame numberofcharacters, and that this number is at most 241 (including the character n). Here is a sample input file:

ACGTTTTAAGGGCTGAGCTAGTCAGTTCATCGCGCGCGTATATCCTCGATCGATCATTCTCTCTAGACGTTTTAAGGGCTGAGCTAGTCAGTTC
ACGTTTTAAGGGCTTAGAGCTTATGCTAATCGCGCGCGTATATCCTCGATCGATCATTCTCTCTAGACGTTTTAAGGGCTAAGGCGCGTAATTA
TCGTTTGAAGGGCTTAGTTAGTTAGTTCATCGGCGGCGTATATCCTCGATCGATCATTCTCTCTAGACGTTTTAAGGGCTGAGCCGGTCAGTTA

Each of the three lines (shown with wrap-around above) consists of 95 characters: the 94 letters from {A, C, G, T} and the character n (not shown). The output file dna output.dat must be structured as follows. For each pair of sequences #i and #j, with i, j ∈ {1,2,3} and i > j, you should print:

  • A single line, saying “Comparison between sequence #i and sequence #j:”
  • The entire sequence #i in the first row, and the entire sequence #j in the third row.
  • Thecomparison between thetwo sequences in the second (middle)row. This shouldbe printed as follows. For each position,ifthetwo bases are thesamein bothsequences then thecorresponding base letter (one of A, C, G, T) should be printed; otherwise a blank " " should be printed.
  • Asingleline,saying“The overlap percentage is x%”where xisafloating-pointnum- berwhichmeasuresthepercentageoflettersthatmatchinthetwosequences. Thisnumbershould be printed with a single digit of precision after the decimal point.

Each line in the output file dna output.dat should contain at most 61 characters, including the end of line character n. If the DNA sequences are longer than that, then each of the three rows mentioned above should be split across several lines, with the first few lines containing exactly 60 letters, and the last containing the rest of the letters. Here is a sample file dna output.dat which results upon processing the file dna input.dat above: See image.

Notes:

  • As part of the solution, you are required to declare, define, and call the following functions. In these functions, you can assume that input and output are global variables of type FILE*.
    • Thefunction read DNA(charsequence[])thatreads aDNAsequencefrominput, stores it in the array sequence[], and returns the number of letters read, as an int.
    • Thefunctioncompare DNA(char seq1[],char seq2[],char seq3[],int n) that stores in the array seq3[] the comparison sequence of the two DNA sequences stored in seq1[] and seq2[]. The length of these DNA sequences is assumed to be n. The fun- ction returns, as a double, the percentage of overlap between the two DNA sequences.
    • Thefunction print DNA(char seq1[], char seq2[], char seq3[], int n) that prints to output the DNA sequences stored in seq1[] and seq2[], as well as their comparisonsequence storedin seq3[], according totherules explainedabove. Thelength of all these sequences is assumed to be n. The function does not return a value.
  • Thenumbers241 and 60, usedabove,shouldbedefinedassymbolicconstantsMAX IN LENGTH and OUT LENGTH, using the #define compiler directive. The program should keep working correctly if the values of these symbolic constants are changed (within a reasonable range).
Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.