Bioinformatics is the retrieval, analysis, storage and visualization of biological information. A big part of this is genome sequencing which is becoming more and more common due to dropping prices and advances in bioinformatics. Genome sequencing is the process of putting together the genetic makeup of an organism. People use this sequencing to find family connections, to study paleo genetics, to aid the medical field, to study evolution and more. Particularly in the medical field, DNA sequencing is used in genetic tests for people, understanding the causes of genetic disorders, making diagnoses and more. An important part of DNA sequencing is annotation. Gene annotation highlights details about the DNA sequence. For example, where the sequence starts and ends, which parts of the sequence causes disorders, which parts are modified, which parts may vary and more. The annotation process can be automated and computers find details about the genes. Automated annotation is efficient and fast but can miss details that computers aren't trained to recognize. Additionally, automated annotation doesn't update previous annotations based on present annotations made.
Genome assembly is a big part of DNA sequencing. Because the genome is very big, the genome is broken up into smaller pieces. When the DNA is extracted it is often from multiple cells or multiple organisms so there are multiple copies of some of the pieces of the DNA. Due to this, there are many overlapping pieces of the sequence. The overlap is what allows the DNA sequence to be pieced together because the pieces are connected with the overlap, and the sequence order can be found. Some parts of DNA can't be isolated very easily, don't sequence accurately and/or don't clone very well. These parts are represented by NNNNNNNNNNNNNNNNNN in the sequence strands. There are some difficulties with this process because the DNA maybe be assembled incorrectly, may have mistakes, missing information and multiple copies of the same pieces.
More information can be gained using the codons of the DNA sequence. For example, a computer program can look at the codons and identify the amino acids, which amino acids are hydrophobic and hydrophilic, which ones are usually attached to sugar groups, whether or not they are folded into helical structures or sheets and more. The codons can also tell you when a coding region starts and ends. The coding region starts with ATG and end in TAA, TGA or TAG. A computer program can identify possible starts and ends to the coding regions.
For this week, I wrote code to assemble DNA pieces. Overlapping copies of DNA pieces were randomly generated using a strand of DNA. The pieces can be pieced together to form the original DNA strand.
DNA Assembly Code