Compare Sequences Using Sequence Alignment Algorithms

Overview of Example

Determining the similarity between two sequences is a common task in computational biology. Starting with a nucleotide sequence for a human gene, this example uses alignment algorithms to locate and verify a corresponding gene in a model organism.

Find a Model Organism to Study

In this example, you are interested in studying Tay-Sachs disease. Tay-Sachs is an autosomal recessive disease caused by the absence of the enzyme beta-hexosaminidase A (Hex A). This enzyme is responsible for the breakdown of gangliosides (GM2) in brain and nerve cells.

First, research information about Tay-Sachs and the enzyme that is associated with this disease, then find the nucleotide sequence for the human gene that codes for the enzyme, and finally find a corresponding gene in another organism to use as a model for study.

  1. Use the MATLAB® Help browser to explore the Web. In the MATLAB Command window, type

    web('http://www.ncbi.nlm.nih.gov/books/NBK22250/')
    

    The MATLAB Help browser opens with the Tay-Sachs disease page in the Genes and Diseases section of the NCBI web site. This section provides a comprehensive introduction to medical genetics. In particular, this page contains an introduction and pictorial representation of the enzyme Hex A and its role in the metabolism of the lipid GM2 ganglioside.

  2. After completing your research, you have concluded the following:

    The gene HEXA codes for the alpha subunit of the dimer enzyme hexosaminidase A (Hex A), while the gene HEXB codes for the beta subunit of the enzyme. A third gene, GM2A, codes for the activator protein GM2. However, it is a mutation in the gene HEXA that causes Tay-Sachs.

Retrieve Sequence Information from a Public Database

The following procedure illustrates how to find the nucleotide sequence for a human gene in a public database and read the sequence information into the MATLAB environment. Many public databases for nucleotide sequences (for example, GenBank®, EMBL-EBI) are accessible from the Web. The MATLAB Command Window with the MATLAB Help browser provide an integrated environment for searching the Web and bringing sequence information into the MATLAB environment.

After you locate a sequence, you need to move the sequence data into the MATLAB Workspace.

  1. Open the MATLAB Help browser to the NCBI Web site. In the MATLAB Command Widow, type

    web('http://www.ncbi.nlm.nih.gov/')
    

    The MATLAB Help browser window opens with the NCBI home page.

  2. Search for the gene you are interested in studying. For example, from the Search list, select Nucleotide, and in the for box enter Tay-Sachs.

    The search returns entries for the genes that code the alpha and beta subunits of the enzyme hexosaminidase A (Hex A), and the gene that codes the activator enzyme. The NCBI reference for the human gene HEXA has accession number NM_000520.

  3. Get sequence data into the MATLAB environment. For example, to get sequence information for the human gene HEXA, type

    humanHEXA = getgenbank('NM_000520')
    

    Note

    Blank spaces in GenBank accession numbers use the underline character. Entering 'NM 00520' returns the wrong entry.

    The human gene is loaded into the MATLAB Workspace as a structure.

    humanHEXA = 
    
                    LocusName: 'NM_000520'
          LocusSequenceLength: '2255'
         LocusNumberofStrands: ''
                LocusTopology: 'linear'
            LocusMoleculeType: 'mRNA'
         LocusGenBankDivision: 'PRI'
        LocusModificationDate: '13-AUG-2006'
                   Definition: 'Homo sapiens hexosaminidase A (alpha polypeptide) (HEXA), mRNA.'
                    Accession: 'NM_000520'
                      Version: 'NM_000520.2'
                           GI: '13128865'
                      Project: []
                     Keywords: []
                      Segment: []
                       Source: 'Homo sapiens (human)'
               SourceOrganism: [4x65 char]
                    Reference: {1x58 cell}
                      Comment: [15x67 char]
                     Features: [74x74 char]
                          CDS: [1x1 struct]
                     Sequence: [1x2255 char]
                    SearchURL: [1x108 char]
                  RetrieveURL: [1x97 char]

Search a Public Database for Related Genes

The following procedure illustrates how to find the nucleotide sequence for a mouse gene related to a human gene, and read the sequence information into the MATLAB environment. The sequence and function of many genes is conserved during the evolution of species through homologous genes. Homologous genes are genes that have a common ancestor and similar sequences. One goal of searching a public database is to find similar genes. If you are able to locate a sequence in a database that is similar to your unknown gene or protein, it is likely that the function and characteristics of the known and unknown genes are the same.

After finding the nucleotide sequence for a human gene, you can do a BLAST search or search in the genome of another organism for the corresponding gene. This procedure uses the mouse genome as an example.

  1. Open the MATLAB Help browser to the NCBI Web site. In the MATLAB Command window, type

    web('http://www.ncbi.nlm.nih.gov')
    
  2. Search the nucleotide database for the gene or protein you are interested in studying. For example, from the Search list, select Nucleotide, and in the for box enter hexosaminidase A.

    The search returns entries for the mouse and human genomes. The NCBI reference for the mouse gene HEXA has accession number AK080777.

  3. Get sequence information for the mouse gene into the MATLAB environment. Type

    mouseHEXA = getgenbank('AK080777')
    

    The mouse gene sequence is loaded into the MATLAB Workspace as a structure.

    mouseHEXA = 
    
                    LocusName: 'AK080777'
          LocusSequenceLength: '1839'
         LocusNumberofStrands: ''
                LocusTopology: 'linear'
            LocusMoleculeType: 'mRNA'
         LocusGenBankDivision: 'HTC'
        LocusModificationDate: '02-SEP-2005'
                   Definition: [1x150 char]
                    Accession: 'AK080777'
                      Version: 'AK080777.1'
                           GI: '26348756'
                      Project: []
                     Keywords: 'HTC; CAP trapper.'
                      Segment: []
                       Source: 'Mus musculus (house mouse)'
               SourceOrganism: [4x65 char]
                    Reference: {1x8 cell}
                      Comment: [8x66 char]
                     Features: [33x74 char]
                          CDS: [1x1 struct]
                     Sequence: [1x1839 char]
                    SearchURL: [1x107 char]
                  RetrieveURL: [1x97 char]

Locate Protein Coding Sequences

The following procedure illustrates how to convert a sequence from nucleotides to amino acids and identify the open reading frames. A nucleotide sequence includes regulatory sequences before and after the protein coding section. By analyzing this sequence, you can determine the nucleotides that code for the amino acids in the final protein.

After you have a list of genes you are interested in studying, you can determine the protein coding sequences. This procedure uses the human gene HEXA and mouse gene HEXA as an example.

  1. If you did not retrieve gene data from the Web, you can load example data from a MAT-file included with the Bioinformatics Toolbox™ software. In the MATLAB Command window, type

    load hexosaminidase
    

    The structures humanHEXA and mouseHEXA load into the MATLAB Workspace.

  2. Locate open reading frames (ORFs) in the human gene. For example, for the human gene HEXA, type

    humanORFs = seqshoworfs(humanHEXA.Sequence)
    

    seqshoworfs creates the output structure humanORFs. This structure contains the position of the start and stop codons for all open reading frames (ORFs) on each reading frame.

    humanORFs = 
    
    1x3 struct array with fields:
        Start
        Stop
    

    The Help browser opens displaying the three reading frames with the ORFs colored blue, red, and green. Notice that the longest ORF is in the first reading frame.

  3. Locate open reading frames (ORFs) in the mouse gene. Type:

    mouseORFs = seqshoworfs(mouseHEXA.Sequence)
    

    seqshoworfs creates the structure mouseORFS.

    mouseORFs = 
    
    1x3 struct array with fields:
        Start
        Stop
    

    The mouse gene shows the longest ORF on the first reading frame.

Compare Amino Acid Sequences

The following procedure illustrates how to use global and local alignment functions to compare two amino acid sequences. You could use alignment functions to look for similarities between two nucleotide sequences, but alignment functions return more biologically meaningful results when you are using amino acid sequences.

After you have located the open reading frames on your nucleotide sequences, you can convert the protein coding sections of the nucleotide sequences to their corresponding amino acid sequences, and then you can compare them for similarities.

  1. Using the open reading frames identified previously, convert the human and mouse DNA sequences to the amino acid sequences. Because both the human and mouse HEXA genes were in the first reading frames (default), you do not need to indicate which frame. Type

    humanProtein = nt2aa(humanHEXA.Sequence);
    mouseProtein = nt2aa(mouseHEXA.Sequence);
    
  2. Draw a dot plot comparing the human and mouse amino acid sequences. Type

    seqdotplot(mouseProtein,humanProtein,4,3)
    ylabel('Mouse hexosaminidase A (alpha subunit)')
    xlabel('Human hexosaminidase A (alpha subunit)')
    

    Dot plots are one of the easiest ways to look for similarity between sequences. The diagonal line shown below indicates that there may be a good alignment between the two sequences.

  3. Globally align the two amino acid sequences, using the Needleman-Wunsch algorithm. Type

    [GlobalScore, GlobalAlignment] = nwalign(humanProtein,...
                                             mouseProtein);
    showalignment(GlobalAlignment)
    

    showalignment displays the global alignment of the two sequences in the Help browser. Notice that the calculated identity between the two sequences is 60%.

    The alignment is very good between amino acid position 69 and 599, after which the two sequences appear to be unrelated. Notice that there is a stop (*) in the sequence at this point. If you shorten the sequences to include only the amino acids that are in the protein you might get a better alignment. Include the amino acid positions from the first methionine (M) to the first stop (*) that occurs after the first methionine.

  4. Trim the sequence from the first start amino acid (usually M) to the first stop (*) and then try alignment again. Find the indices for the stops in the sequences.

    humanStops = find(humanProtein == '*')
    
    humanStops =
    
        41   599   611   713   722   730
    
    
    mouseStops = find(mouseProtein == '*')
    
    mouseStops =
    
       539   557   574   606
    

    Looking at the amino acid sequence for humanProtein, the first M is at position 70, and the first stop after that position is actually the second stop in the sequence (position 599). Looking at the amino acid sequence for mouseProtein, the first M is at position 11, and the first stop after that position is the first stop in the sequence (position 557).

  5. Truncate the sequences to include only amino acids in the protein and the stop.

    humanProteinORF = humanProtein(70:humanStops(2))
    
    humanProteinORF =
    
    MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDV
    SSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVV
    TPGCNQLPTLESVENYTLTINDDQCLLLSETVWGALRGLETFSQLVWKSA
    EGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLSSILDTLDVMAYNKLNV
    FHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEYARLRG
    IRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEF
    MSTFFLEVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQ
    LESFYIQTLLDIVSSYGKGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNY
    MKELELVTKAGFRALLSAPWYLNRISYGPDWKDFYIVEPLAFEGTPEQKA
    LVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKLTSDLTFAYERL
    SHFRCELLRRGVQAQPLNVGFCEQEFEQT*
    
    
    mouseProteinORF = mouseProtein(11:mouseStops(1))
    
    mouseProteinORF =
    
    MAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYTLYPNNFQFRYHV
    SSAAQAGCVVLDEAFRRYRNLLFGSGSWPRPSFSNKQQTLGKNILVVSVV
    TAECNEFPNLESVENYTLTINDDQCLLASETVWGALRGLETFSQLVWKSA
    EGTFFINKTKIKDFPRFPHRGVLLDTSRHYLPLSSILDTLDVMAYNKFNV
    FHWHLVDDSSFPYESFTFPELTRKGSFNPVTHIYTAQDVKEVIEYARLRG
    IRVLAEFDTPGHTLSWGPGAPGLLTPCYSGSHLSGTFGPVNPSLNSTYDF
    MSTLFLEISSVFPDFYLHLGGDEVDFTCWKSNPNIQAFMKKKGFTDFKQL
    ESFYIQTLLDIVSDYDKGYVVWQEVFDNKVKVRPDTIIQVWREEMPVEYM
    LEMQDITRAGFRALLSAPWYLNRVKYGPDWKDMYKVEPLAFHGTPEQKAL
    VIGGEACMWGEYVDSTNLVPRLWPRAGAVAERLWSSNLTTNIDFAFKRLS
    HFRCELVRRGIQAQPISVGCCEQEFEQT* 
    
  6. Globally align the trimmed amino acid sequences. Type

    [GlobalScore_trim, GlobalAlignment_trim] = nwalign(humanProteinORF,...
                                 mouseProteinORF);
    showalignment(GlobalAlignment_trim)
    

    showalignment displays the results for the second global alignment. Notice that the percent identity for the untrimmed sequences is 60% and 84% for trimmed sequences.

  7. Another way to truncate an amino acid sequence to only those amino acids in the protein is to first truncate the nucleotide sequence with indices from the seqshoworfs function. Remember that the ORF for the human HEXA gene and the ORF for the mouse HEXA were both on the first reading frame.

    humanORFs = seqshoworfs(humanHEXA.Sequence)
    
    humanORFs = 
    
    1x3 struct array with fields:
        Start
        Stop
    
    
    mouseORFs = seqshoworfs(mouseHEXA.Sequence)
    
    mouseORFs = 
    
    1x3 struct array with fields:
        Start
        Stop
    
    
    humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):...
                                         humanORFs(1).Stop(1)));
    
    mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):...
                                         mouseORFs(1).Stop(1)));
    
    [GlobalScore2, GlobalAlignment2] = nwalign(humanPORF, mousePORF);
    

    Show the alignment in the Help browser.

    showalignment(GlobalAlignment2)
    

    The result from first truncating a nucleotide sequence before converting it to an amino acid sequence is the same as the result from truncating the amino acid sequence after conversion. See the result in step 6.

    An alternative method to working with subsequences is to use a local alignment function with the nontruncated sequences.

  8. Locally align the two amino acid sequences using a Smith-Waterman algorithm. Type

    [LocalScore, LocalAlignment] = swalign(humanProtein,...
                                           mouseProtein)
    
    LocalScore =
           1057
    
    LocalAlignment =
    
    RGDQR-AMTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYV . . .
    ||  | ||:: ||| |||||||:|  ||||||||| :||  :||: . . .
    RGAGRWAMAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYT . . .
    
  9. Show the alignment in color.

    showalignment(LocalAlignment)