featureparse

Parse features from GenBank, GenPept, or EMBL data

Syntax

FeatStruct = featureparse(Features)
FeatStruct = featureparse(Features, ...'Feature', FeatureValue, ...)
FeatStruct = featureparse(Features, ...'Sequence', SequenceValue, ...)

Input Arguments

FeaturesAny of the following:
  • MATLAB® structure with fields corresponding to GenBank®, GenPept, or EMBL data, such as those returned by genbankread, genpeptread, emblread, getgenbank, getgenpept, or getembl

  • Character vector or character array containing the text from the Features section of a GenBank, GenPept, or EMBL-formatted file

FeatureValueName of a feature contained in Features. When specified, featureparse returns only the substructure that corresponds to this feature. If there are multiple features with the same FeatureValue, then FeatStruct is an array of structures.
SequenceValueProperty to control the extraction, when possible, of the sequences respective to each feature, joining and complementing pieces of the source sequence and storing them in the Sequence field of the returned structure, FeatStruct. When extracting the sequence from an incomplete CDS feature, featureparse uses the codon_start qualifier to adjust the frame of the sequence. Choices are true or false (default).

Output Arguments

FeatStructOutput structure containing a field for every database feature. Each field name in FeatStruct matches the corresponding feature name in the GenBank, GenPept, or EMBL database, with the exceptions listed in the table below. Fields in FeatStruct contain substructures with feature qualifiers as fields. In the GenBank, GenPept, and EMBL databases, for each feature, the only mandatory qualifier is its location, which featureparse translates to the field Location. When possible, featureparse also translates this location to numeric indices, creating an Indices field.

Note

If you use the Indices field to extract sequence information, you may need to complement the sequences.

Description

FeatStruct = featureparse(Features) parses the features from Features, which contains GenBank, GenPept, or EMBL features. Features can be a:

  • Character vector or string containing GenBank, GenPept, or EMBL features

  • MATLAB character array including text describing GenBank, GenPept, or EMBL features

  • MATLAB structure with fields corresponding to GenBank, GenPept, or EMBL data, such as those returned by genbankread, genpeptread, emblread, getgenbank, getgenpept, or getembl

FeatStruct is the output structure containing a field for every database feature. Each field name in FeatStruct matches the corresponding feature name in the GenBank, GenPept, or EMBL database, with the following exceptions.

Feature Name in GenBank, GenPept, or EMBL DatabaseField Name in MATLAB Structure
-10_signalminus_10_signal
-35_signalminus_35_signal
3'UTRthree_prime_UTR
3'clip three_prime_clip
5'UTR five_prime_UTR
5'clip five_prime_clip
D-loop D_loop

Fields in FeatStruct contain substructures with feature qualifiers as fields. In the GenBank, GenPept, and EMBL databases, for each feature, the only mandatory qualifier is its location, which featureparse translates to the field Location. When possible, featureparse also translates this location to numeric indices, creating an Indices field.

Note

If you use the Indices field to extract sequence information, you may need to complement the sequences.

FeatStruct = featureparse (Features, ...'PropertyName', PropertyValue, ...) calls featureparse with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

FeatStruct = featureparse(Features, ...'Feature', FeatureValue, ...) returns only the substructure that corresponds to FeatureValue, the name of a feature contained in Features. If there are multiple features with the same FeatureValue, then FeatStruct is an array of structures.

FeatStruct = featureparse(Features, ...'Sequence', SequenceValue, ...) controls the extraction, when possible, of the sequences respective to each feature, joining and complementing pieces of the source sequence and storing them in the field Sequence. When extracting the sequence from an incomplete CDS feature, featureparse uses the codon_start qualifier to adjust the frame of the sequence. Choices are true or false (default).

Examples

Example 16. Obtaining All Features from a GenBank File

The following example obtains all the features stored in the GenBank file nm175642.txt:

gbkStruct = genbankread('nm175642.txt');
features = featureparse(gbkStruct)

features = 

    source: [1x1 struct]
      gene: [1x1 struct]
       CDS: [1x1 struct]

Example 17. Obtaining a Subset of Features from a GenBank Record

The following example obtains only the coding sequences (CDS) feature of the Caenorhabditis elegans cosmid record (accession number Z92777) from the GenBank database:

worm = getgenbank('Z92777');
CDS = featureparse(worm,'feature','cds')

CDS = 

1x12 struct array with fields:
    Location
    Indices
    locus_tag
    standard_name
    note
    codon_start
    product
    protein_id
    db_xref
    translation

Example 18. Extracting Sequences for Each Feature
  1. Retrieve two nucleotide sequences from the GenBank database for the neuraminidase (NA) protein of two strains of the Influenza A virus (H5N1).

     hk01 = getgenbank('AF509094');
     vt04 = getgenbank('DQ094287');
    
  2. Extract the sequence of the coding region for the neuraminidase (NA) protein from the two nucleotide sequences. The sequences of the coding regions are stored in the Sequence fields of the returned structures, hk01_cds and vt04_cds.

    hk01_cds = featureparse(hk01,'feature','CDS','Sequence',true);
    vt04_cds = featureparse(vt04,'feature','CDS','Sequence',true);
    
  3. Once you have extracted the nucleotide sequences, you can use the nt2aa and nwalign functions to align the amino acids sequences converted from the nucleotide sequences.

     [sc,al]=nwalign(nt2aa(hk01_cds),nt2aa(vt04_cds),'extendgap',1);
    
  4. Then you can use the seqinsertgaps function to copy the gaps from the aligned amino acid sequences to their corresponding nucleotide sequences, thus codon-aligning them.

     hk01_aligned = seqinsertgaps(hk01_cds,al(1,:))
     vt04_aligned = seqinsertgaps(vt04_cds,al(3,:))
    
  5. Once you have code aligned the two sequences, you can use them as input to other functions such as dnds, which calculates the synonymous and nonsynonymous substitutions rates of the codon-aligned nucleotide sequences. By setting Verbose to true, you can also display the codons considered in the computations and their amino acid translations.

    [dn,ds] = dnds(hk01_aligned,vt04_aligned,'verbose',true)
    
Introduced in R2006b