Parse features from GenBank, GenPept, or EMBL data
FeatStruct
= featureparse(Features
)
FeatStruct
=
featureparse(Features
, ...'Feature', FeatureValue
,
...)
FeatStruct
= featureparse(Features
,
...'Sequence', SequenceValue
, ...)
Features | Any of the following:
|
FeatureValue | Name of a feature contained in Features .
When specified, featureparse returns only the substructure
that corresponds to this feature. If there are multiple features with
the same FeatureValue , then FeatStruct is
an array of structures. |
SequenceValue | Property to control the extraction, when possible, of the sequences
respective to each feature, joining and complementing pieces of the
source sequence and storing them in the Sequence field
of the returned structure, FeatStruct .
When extracting the sequence from an incomplete CDS feature, featureparse uses
the codon_start qualifier to adjust the frame of
the sequence. Choices are true or false (default). |
FeatStruct | Output structure containing a field for every database feature.
Each field name in FeatStruct matches the
corresponding feature name in the GenBank, GenPept, or EMBL database,
with the exceptions listed in the table below. Fields in FeatStruct contain
substructures with feature qualifiers as fields. In the GenBank,
GenPept, and EMBL databases, for each feature, the only mandatory
qualifier is its location, which featureparse translates
to the field Location . When possible, featureparse also
translates this location to numeric indices, creating an Indices field.Note If you use the |
parses
the features from FeatStruct
= featureparse(Features
)Features
, which contains GenBank,
GenPept, or EMBL features. Features
can
be a:
Character vector or string containing GenBank, GenPept, or EMBL features
MATLAB character array including text describing GenBank, GenPept, or EMBL features
MATLAB structure with fields corresponding to GenBank,
GenPept, or EMBL data, such as those returned by genbankread
, genpeptread
, emblread
, getgenbank
, getgenpept
, or getembl
FeatStruct
is the output structure
containing a field for every database feature. Each field name in FeatStruct
matches
the corresponding feature name in the GenBank, GenPept, or EMBL
database, with the following exceptions.
Feature Name in GenBank, GenPept, or EMBL Database | Field Name in MATLAB Structure |
---|---|
-10_signal | minus_10_signal |
-35_signal | minus_35_signal |
3'UTR | three_prime_UTR |
3'clip | three_prime_clip |
5'UTR | five_prime_UTR |
5'clip | five_prime_clip |
D-loop | D_loop |
Fields in FeatStruct
contain substructures
with feature qualifiers as fields. In the GenBank, GenPept, and
EMBL databases, for each feature, the only mandatory qualifier is
its location, which featureparse
translates to
the field Location
. When possible, featureparse
also
translates this location to numeric indices, creating an Indices
field.
Note
If you use the Indices
field to extract sequence
information, you may need to complement the sequences.
calls FeatStruct
= featureparse
(Features
, ...'PropertyName
', PropertyValue
,
...)featureparse
with optional
properties that use property name/property value pairs. You can specify
one or more properties in any order. Each PropertyName
must
be enclosed in single quotation marks and is case insensitive. These
property name/property value pairs are as follows:
returns only the substructure that corresponds
to FeatStruct
=
featureparse(Features
, ...'Feature', FeatureValue
,
...)FeatureValue
, the name of a feature
contained in Features
. If there are multiple
features with the same FeatureValue
, then FeatStruct
is
an array of structures.
controls
the extraction, when possible, of the sequences respective to each
feature, joining and complementing pieces of the source sequence and
storing them in the field FeatStruct
= featureparse(Features
,
...'Sequence', SequenceValue
, ...)Sequence
. When extracting
the sequence from an incomplete CDS feature, featureparse
uses
the codon_start
qualifier to adjust the frame of
the sequence. Choices are true
or false
(default).
The following example obtains all the features stored in the GenBank file nm175642.txt
:
gbkStruct = genbankread('nm175642.txt'); features = featureparse(gbkStruct) features = source: [1x1 struct] gene: [1x1 struct] CDS: [1x1 struct]
The following example obtains only the coding sequences (CDS) feature of the Caenorhabditis elegans cosmid record (accession number Z92777) from the GenBank database:
worm = getgenbank('Z92777'); CDS = featureparse(worm,'feature','cds') CDS = 1x12 struct array with fields: Location Indices locus_tag standard_name note codon_start product protein_id db_xref translation
Retrieve two nucleotide sequences from the GenBank database for the neuraminidase (NA) protein of two strains of the Influenza A virus (H5N1).
hk01 = getgenbank('AF509094'); vt04 = getgenbank('DQ094287');
Extract the sequence of the coding region for the
neuraminidase (NA) protein from the two nucleotide sequences. The
sequences of the coding regions are stored in the Sequence
fields
of the returned structures, hk01_cds
and vt04_cds
.
hk01_cds = featureparse(hk01,'feature','CDS','Sequence',true); vt04_cds = featureparse(vt04,'feature','CDS','Sequence',true);
Once you have extracted the nucleotide sequences,
you can use the nt2aa
and nwalign
functions to align the amino
acids sequences converted from the nucleotide sequences.
[sc,al]=nwalign(nt2aa(hk01_cds),nt2aa(vt04_cds),'extendgap',1);
Then you can use the seqinsertgaps
function
to copy the gaps from the aligned amino acid sequences to their corresponding
nucleotide sequences, thus codon-aligning them.
hk01_aligned = seqinsertgaps(hk01_cds,al(1,:)) vt04_aligned = seqinsertgaps(vt04_cds,al(3,:))
Once you have code aligned the two sequences, you
can use them as input to other functions such as dnds
, which calculates the synonymous
and nonsynonymous substitutions rates of the codon-aligned nucleotide
sequences. By setting Verbose
to true
,
you can also display the codons considered in the computations and
their amino acid translations.
[dn,ds] = dnds(hk01_aligned,vt04_aligned,'verbose',true)
emblread
| genbankread
| genpeptread
| getgenbank
| getgenpept