Determining the similarity between two sequences is a common task in computational biology. Starting with a nucleotide sequence for a human gene, this example uses alignment algorithms to locate and verify a corresponding gene in a model organism.
In this example, you are interested in studying Tay-Sachs disease. Tay-Sachs is an autosomal recessive disease caused by the absence of the enzyme beta-hexosaminidase A (Hex A). This enzyme is responsible for the breakdown of gangliosides (GM2) in brain and nerve cells.
First, research information about Tay-Sachs and the enzyme that is associated with this disease, then find the nucleotide sequence for the human gene that codes for the enzyme, and finally find a corresponding gene in another organism to use as a model for study.
Use the MATLAB® Help browser to explore the Web. In the MATLAB Command window, type
web('http://www.ncbi.nlm.nih.gov/books/NBK22250/')
The MATLAB Help browser opens with the Tay-Sachs disease page in the Genes and Diseases section of the NCBI web site. This section provides a comprehensive introduction to medical genetics. In particular, this page contains an introduction and pictorial representation of the enzyme Hex A and its role in the metabolism of the lipid GM2 ganglioside.
After completing your research, you have concluded the following:
The gene HEXA codes for the alpha subunit of the dimer enzyme hexosaminidase A (Hex A), while the gene HEXB codes for the beta subunit of the enzyme. A third gene, GM2A, codes for the activator protein GM2. However, it is a mutation in the gene HEXA that causes Tay-Sachs.
The following procedure illustrates how to find the nucleotide sequence for a human gene in a public database and read the sequence information into the MATLAB environment. Many public databases for nucleotide sequences (for example, GenBank®, EMBL-EBI) are accessible from the Web. The MATLAB Command Window with the MATLAB Help browser provide an integrated environment for searching the Web and bringing sequence information into the MATLAB environment.
After you locate a sequence, you need to move the sequence data into the MATLAB Workspace.
Open the MATLAB Help browser to the NCBI Web site. In the MATLAB Command Widow, type
web('http://www.ncbi.nlm.nih.gov/')
The MATLAB Help browser window opens with the NCBI home page.
Search for the gene you are interested in studying.
For example, from the Search list,
select Nucleotide
, and in the for box
enter Tay-Sachs
.
The search returns entries for the genes that code the alpha
and beta subunits of the enzyme hexosaminidase A (Hex A), and the
gene that codes the activator enzyme. The NCBI reference for the human
gene HEXA has accession number NM_000520
.
Get sequence data into the MATLAB environment. For example, to get sequence information for the human gene HEXA, type
humanHEXA = getgenbank('NM_000520')
Note
Blank spaces in GenBank accession numbers use the underline
character. Entering 'NM 00520'
returns the wrong
entry.
The human gene is loaded into the MATLAB Workspace as a structure.
humanHEXA = LocusName: 'NM_000520' LocusSequenceLength: '2255' LocusNumberofStrands: '' LocusTopology: 'linear' LocusMoleculeType: 'mRNA' LocusGenBankDivision: 'PRI' LocusModificationDate: '13-AUG-2006' Definition: 'Homo sapiens hexosaminidase A (alpha polypeptide) (HEXA), mRNA.' Accession: 'NM_000520' Version: 'NM_000520.2' GI: '13128865' Project: [] Keywords: [] Segment: [] Source: 'Homo sapiens (human)' SourceOrganism: [4x65 char] Reference: {1x58 cell} Comment: [15x67 char] Features: [74x74 char] CDS: [1x1 struct] Sequence: [1x2255 char] SearchURL: [1x108 char] RetrieveURL: [1x97 char]
The following procedure illustrates how to find the nucleotide sequence for a mouse gene related to a human gene, and read the sequence information into the MATLAB environment. The sequence and function of many genes is conserved during the evolution of species through homologous genes. Homologous genes are genes that have a common ancestor and similar sequences. One goal of searching a public database is to find similar genes. If you are able to locate a sequence in a database that is similar to your unknown gene or protein, it is likely that the function and characteristics of the known and unknown genes are the same.
After finding the nucleotide sequence for a human gene, you can do a BLAST search or search in the genome of another organism for the corresponding gene. This procedure uses the mouse genome as an example.
Open the MATLAB Help browser to the NCBI Web site. In the MATLAB Command window, type
web('http://www.ncbi.nlm.nih.gov')
Search the nucleotide database for the gene or protein
you are interested in studying. For example, from the Search list, select Nucleotide
,
and in the for box enter hexosaminidase
A
.
The search returns entries for the mouse and human genomes.
The NCBI reference for the mouse gene HEXA has accession number AK080777
.
Get sequence information for the mouse gene into the MATLAB environment. Type
mouseHEXA = getgenbank('AK080777')
The mouse gene sequence is loaded into the MATLAB Workspace as a structure.
mouseHEXA = LocusName: 'AK080777' LocusSequenceLength: '1839' LocusNumberofStrands: '' LocusTopology: 'linear' LocusMoleculeType: 'mRNA' LocusGenBankDivision: 'HTC' LocusModificationDate: '02-SEP-2005' Definition: [1x150 char] Accession: 'AK080777' Version: 'AK080777.1' GI: '26348756' Project: [] Keywords: 'HTC; CAP trapper.' Segment: [] Source: 'Mus musculus (house mouse)' SourceOrganism: [4x65 char] Reference: {1x8 cell} Comment: [8x66 char] Features: [33x74 char] CDS: [1x1 struct] Sequence: [1x1839 char] SearchURL: [1x107 char] RetrieveURL: [1x97 char]
The following procedure illustrates how to convert a sequence from nucleotides to amino acids and identify the open reading frames. A nucleotide sequence includes regulatory sequences before and after the protein coding section. By analyzing this sequence, you can determine the nucleotides that code for the amino acids in the final protein.
After you have a list of genes you are interested in studying, you can determine the protein coding sequences. This procedure uses the human gene HEXA and mouse gene HEXA as an example.
If you did not retrieve gene data from the Web, you can load example data from a MAT-file included with the Bioinformatics Toolbox™ software. In the MATLAB Command window, type
load hexosaminidase
The structures humanHEXA
and mouseHEXA
load
into the MATLAB Workspace.
Locate open reading frames (ORFs) in the human gene. For example, for the human gene HEXA, type
humanORFs = seqshoworfs(humanHEXA.Sequence)
seqshoworfs
creates the output structure humanORFs
.
This structure contains the position of the start and stop codons
for all open reading frames (ORFs) on each reading frame.
humanORFs = 1x3 struct array with fields: Start Stop
The Help browser opens displaying the three reading frames with the ORFs colored blue, red, and green. Notice that the longest ORF is in the first reading frame.
Locate open reading frames (ORFs) in the mouse gene. Type:
mouseORFs = seqshoworfs(mouseHEXA.Sequence)
seqshoworfs
creates the structure mouseORFS
.
mouseORFs = 1x3 struct array with fields: Start Stop
The mouse gene shows the longest ORF on the first reading frame.
The following procedure illustrates how to use global and local alignment functions to compare two amino acid sequences. You could use alignment functions to look for similarities between two nucleotide sequences, but alignment functions return more biologically meaningful results when you are using amino acid sequences.
After you have located the open reading frames on your nucleotide sequences, you can convert the protein coding sections of the nucleotide sequences to their corresponding amino acid sequences, and then you can compare them for similarities.
Using the open reading frames identified previously, convert the human and mouse DNA sequences to the amino acid sequences. Because both the human and mouse HEXA genes were in the first reading frames (default), you do not need to indicate which frame. Type
humanProtein = nt2aa(humanHEXA.Sequence); mouseProtein = nt2aa(mouseHEXA.Sequence);
Draw a dot plot comparing the human and mouse amino acid sequences. Type
seqdotplot(mouseProtein,humanProtein,4,3) ylabel('Mouse hexosaminidase A (alpha subunit)') xlabel('Human hexosaminidase A (alpha subunit)')
Dot plots are one of the easiest ways to look for similarity between sequences. The diagonal line shown below indicates that there may be a good alignment between the two sequences.
Globally align the two amino acid sequences, using the Needleman-Wunsch algorithm. Type
[GlobalScore, GlobalAlignment] = nwalign(humanProtein,... mouseProtein); showalignment(GlobalAlignment)
showalignment
displays the global alignment
of the two sequences in the Help browser. Notice that the calculated
identity between the two sequences is 60%
.
The alignment is very good between amino acid position 69 and
599, after which the two sequences appear to be unrelated. Notice
that there is a stop (*
) in the sequence at this
point. If you shorten the sequences to include only the amino acids
that are in the protein you might get a better alignment. Include
the amino acid positions from the first methionine (M
)
to the first stop (*
) that occurs after the first
methionine.
Trim the sequence from the first start amino acid
(usually M
) to the first stop (*
)
and then try alignment again. Find the indices for the stops in the
sequences.
humanStops = find(humanProtein == '*') humanStops = 41 599 611 713 722 730 mouseStops = find(mouseProtein == '*') mouseStops = 539 557 574 606
Looking at the amino acid sequence for humanProtein
,
the first M
is at position 70, and the first stop
after that position is actually the second stop in the sequence (position
599). Looking at the amino acid sequence for mouseProtein
,
the first M
is at position 11, and the first stop
after that position is the first stop in the sequence (position 557).
Truncate the sequences to include only amino acids in the protein and the stop.
humanProteinORF = humanProtein(70:humanStops(2)) humanProteinORF = MTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYVLYPNNFQFQYDV SSAAQPGCSVLDEAFQRYRDLLFGSGSWPRPYLTGKRHTLEKNVLVVSVV TPGCNQLPTLESVENYTLTINDDQCLLLSETVWGALRGLETFSQLVWKSA EGTFFINKTEIEDFPRFPHRGLLLDTSRHYLPLSSILDTLDVMAYNKLNV FHWHLVDDPSFPYESFTFPELMRKGSYNPVTHIYTAQDVKEVIEYARLRG IRVLAEFDTPGHTLSWGPGIPGLLTPCYSGSEPSGTFGPVNPSLNNTYEF MSTFFLEVSSVFPDFYLHLGGDEVDFTCWKSNPEIQDFMRKKGFGEDFKQ LESFYIQTLLDIVSSYGKGYVVWQEVFDNKVKIQPDTIIQVWREDIPVNY MKELELVTKAGFRALLSAPWYLNRISYGPDWKDFYIVEPLAFEGTPEQKA LVIGGEACMWGEYVDNTNLVPRLWPRAGAVAERLWSNKLTSDLTFAYERL SHFRCELLRRGVQAQPLNVGFCEQEFEQT* mouseProteinORF = mouseProtein(11:mouseStops(1)) mouseProteinORF = MAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYTLYPNNFQFRYHV SSAAQAGCVVLDEAFRRYRNLLFGSGSWPRPSFSNKQQTLGKNILVVSVV TAECNEFPNLESVENYTLTINDDQCLLASETVWGALRGLETFSQLVWKSA EGTFFINKTKIKDFPRFPHRGVLLDTSRHYLPLSSILDTLDVMAYNKFNV FHWHLVDDSSFPYESFTFPELTRKGSFNPVTHIYTAQDVKEVIEYARLRG IRVLAEFDTPGHTLSWGPGAPGLLTPCYSGSHLSGTFGPVNPSLNSTYDF MSTLFLEISSVFPDFYLHLGGDEVDFTCWKSNPNIQAFMKKKGFTDFKQL ESFYIQTLLDIVSDYDKGYVVWQEVFDNKVKVRPDTIIQVWREEMPVEYM LEMQDITRAGFRALLSAPWYLNRVKYGPDWKDMYKVEPLAFHGTPEQKAL VIGGEACMWGEYVDSTNLVPRLWPRAGAVAERLWSSNLTTNIDFAFKRLS HFRCELVRRGIQAQPISVGCCEQEFEQT*
Globally align the trimmed amino acid sequences. Type
[GlobalScore_trim, GlobalAlignment_trim] = nwalign(humanProteinORF,... mouseProteinORF); showalignment(GlobalAlignment_trim)
showalignment
displays the results for the
second global alignment. Notice that the percent identity for the
untrimmed sequences is 60%
and 84%
for
trimmed sequences.
Another way to truncate an amino acid sequence to
only those amino acids in the protein is to first truncate the nucleotide
sequence with indices from the seqshoworfs
function.
Remember that the ORF for the human HEXA gene and the ORF for the
mouse HEXA were both on the first reading frame.
humanORFs = seqshoworfs(humanHEXA.Sequence) humanORFs = 1x3 struct array with fields: Start Stop mouseORFs = seqshoworfs(mouseHEXA.Sequence) mouseORFs = 1x3 struct array with fields: Start Stop humanPORF = nt2aa(humanHEXA.Sequence(humanORFs(1).Start(1):... humanORFs(1).Stop(1))); mousePORF = nt2aa(mouseHEXA.Sequence(mouseORFs(1).Start(1):... mouseORFs(1).Stop(1))); [GlobalScore2, GlobalAlignment2] = nwalign(humanPORF, mousePORF);
Show the alignment in the Help browser.
showalignment(GlobalAlignment2)
The result from first truncating a nucleotide sequence before converting it to an amino acid sequence is the same as the result from truncating the amino acid sequence after conversion. See the result in step 6.
An alternative method to working with subsequences is to use a local alignment function with the nontruncated sequences.
Locally align the two amino acid sequences using a Smith-Waterman algorithm. Type
[LocalScore, LocalAlignment] = swalign(humanProtein,... mouseProtein) LocalScore = 1057 LocalAlignment = RGDQR-AMTSSRLWFSLLLAAAFAGRATALWPWPQNFQTSDQRYV . . . || | ||:: ||| |||||||:| ||||||||| :|| :||: . . . RGAGRWAMAGCRLWVSLLLAAALACLATALWPWPQYIQTYHRRYT . . .
Show the alignment in color.
showalignment(LocalAlignment)