High-throughput sequencing instruments produce large amounts of sequence read data that can be challenging to store and manage. Using objects to contain this data lets you easily access, manipulate, and filter the data.
Bioinformatics Toolbox™ includes two objects for working with sequence read data.
Object | Contains This Information | Construct from One of These |
---|---|---|
BioRead
|
| |
BioMap
|
|
A BioRead
object represents a collection of sequence reads. Each
element in the object is associated with a sequence, sequence header, and sequence quality
information.
Construct a BioRead
object in one of two ways:
Indexed — The data remains in the source
file. Constructing the object and accessing its contents is memory efficient. However,
you cannot modify object properties, other than the Name
property.
This is the default method if you construct a BioRead
object from
a FASTQ- or SAM-formatted file.
In Memory — The data is read into memory.
Constructing the object and accessing its contents is limited by the amount of
available memory. However, you can modify object properties. When you construct a
BioRead
object from a FASTQ structure or cell arrays, the data
is read into memory. When you construct a BioRead
object from a
FASTQ- or SAM-formatted file, use the InMemory
name-value pair
argument to read the data into memory.
Note
This example constructs a BioRead
object from a FASTQ-formatted
file. Use similar steps to construct a BioRead
object from a
SAM-formatted file.
Use the BioRead
constructor function to construct a
BioRead
object from a FASTQ-formatted file and set the
Name
property:
BRObj1 = BioRead('SRR005164_1_50.fastq', 'Name', 'MyObject')
BRObj1 = BioRead with properties: Quality: [50x1 File indexed property] Sequence: [50x1 File indexed property] Header: [50x1 File indexed property] NSeqs: 50 Name: 'MyObject'
The constructor function construct a BioRead
object and, if an
index file does not already exist, it also creates an index file with the same file name,
but with an .IDX extension. This index file, by default, is stored in the same location as
the source file.
Caution
Your source file and index file must always be in sync.
After constructing a BioRead
object, do not modify the
index file, or you can get invalid results when using the existing object or
constructing new objects.
If you modify the source file, delete the index file, so the object constructor creates a new index file when constructing new objects.
Note
Because you constructed this BioRead
object from a source file,
you cannot modify the properties (except for Name
) of the
BioRead
object.
A BioMap
object represents a collection of sequence reads that
map against a single reference sequence. Each element in the object is associated with a
read sequence, sequence header, sequence quality information, and alignment/mapping
information.
When constructing a BioMap
object from a BAM file, the maximum
size of the file is limited by your operating system and available memory.
Construct a BioMap
object in one of two ways:
Indexed — The data remains in the source
file. Constructing the object and accessing its contents is memory efficient. However,
you cannot modify object properties, other than the Name
property.
This is the default method if you construct a BioMap
object from
a SAM- or BAM-formatted file.
In Memory — The data is read into memory.
Constructing the object and accessing its contents is limited by the amount of
available memory. However, you can modify object properties. When you construct a
BioMap
object from a structure, the data stays in memory. When
you construct a BioMap
object from a SAM- or BAM-formatted file,
use the InMemory
name-value pair argument to read the data into
memory.
Note
This example constructs a BioMap
object from a SAM-formatted
file. Use similar steps to construct a BioMap
object from a
BAM-formatted file.
If you do not know the number and names of the reference sequences in your source
file, determine them using the saminfo
or baminfo
function and the
ScanDictionary
name-value pair argument.
samstruct = saminfo('ex2.sam', 'ScanDictionary', true); samstruct.ScannedDictionary
ans = 'seq1' 'seq2'
Tip
The previous syntax scans the entire SAM file, which is time consuming. If you
are confident that the Header information of the SAM file is correct, omit the
ScanDictionary
name-value pair argument, and inspect the
SequenceDictionary
field instead.
Use the BioMap
constructor function to construct a
BioMap
object from the SAM file and set the
Name
property. Because the SAM-formatted file in this example,
ex2.sam
, contains multiple reference sequences, use the
SelectRef
name-value pair argument to specify one reference
sequence, seq1
:
BMObj2 = BioMap('ex2.sam', 'SelectRef', 'seq1', 'Name', 'MyObject')
BMObj2 = BioMap with properties: SequenceDictionary: 'seq1' Reference: [1501x1 File indexed property] Signature: [1501x1 File indexed property] Start: [1501x1 File indexed property] MappingQuality: [1501x1 File indexed property] Flag: [1501x1 File indexed property] MatePosition: [1501x1 File indexed property] Quality: [1501x1 File indexed property] Sequence: [1501x1 File indexed property] Header: [1501x1 File indexed property] NSeqs: 1501 Name: 'MyObject'
The constructor function constructs a BioMap
object and, if index
files do not already exist, it also creates one or two index files:
If constructing from a SAM-formatted file, it creates one index file that has the same file name as the source file, but with an .IDX extension. This index file, by default, is stored in the same location as the source file.
If constructing from a BAM-formatted file, it creates two index files that have the same file name as the source file, but one with a .BAI extension and one with a .LINEARINDEX extension. These index files, by default, are stored in the same location as the source file.
Caution
Your source file and index files must always be in sync.
After constructing a BioMap
object, do not modify the index
files, or you can get invalid results when using the existing object or constructing
new objects.
If you modify the source file, delete the index files, so the object constructor creates new index files when constructing new objects.
Note
Because you constructed this BioMap
object from a source file,
you cannot modify the properties (except for Name
and
Reference
) of the BioMap
object.
Note
This example constructs a BioMap
object from a SAM structure
using samread
. Use similar steps to construct a
BioMap
object from a BAM structure using
bamread
.
Use the samread
function to create a SAM
structure from a SAM-formatted file:
SAMStruct = samread('ex2.sam');
To construct a valid BioMap
object from a SAM-formatted file,
the file must contain only one reference sequence. Determine the number and names of
the reference sequences in your SAM-formatted file using the unique
function to find unique names in the
ReferenceName
field of the structure:
unique({SAMStruct.ReferenceName})
ans = 'seq1' 'seq2'
Use the BioMap
constructor function to construct a
BioMap
object from a SAM structure. Because the SAM structure
contains multiple reference sequences, use the SelectRef
name-value
pair argument to specify one reference sequence, seq1
:
BMObj1 = BioMap(SAMStruct, 'SelectRef', 'seq1')
BMObj1 = BioMap with properties: SequenceDictionary: {'seq1'} Reference: {1501x1 cell} Signature: {1501x1 cell} Start: [1501x1 uint32] MappingQuality: [1501x1 uint8] Flag: [1501x1 uint16] MatePosition: [1501x1 uint32] Quality: {1501x1 cell} Sequence: {1501x1 cell} Header: {1501x1 cell} NSeqs: 1501 Name: ''
You can retrieve all or a subset of information from a BioRead
or
BioMap
object.
You can retrieve a specific property from elements in a BioRead
or BioMap
object.
For example, to retrieve all headers from a BioRead
object, use
the Header
property as follows:
allHeaders = BRObj1.Header;
This syntax returns a cell array containing the headers for all elements in the
BioRead
object.
Similarly, to retrieve all start positions of aligned read sequences from a
BioMap
object, use the Start
property of the
object:
allStarts = BMObj1.Start;
This syntax returns a vector containing the start positions of aligned read sequences
with respect to the position numbers in the reference sequence in a
BioMap
object.
You can retrieve multiple properties from a BioRead
or
BioMap
object in a single command using the get
method. For example, to retrieve both start positions and headers
information of a BioMap
object, use the get
method as
follows:
multiProp = get(BMObj1, {'Start', 'Header'});
This syntax returns a cell array containing all start positions and headers
information of a BioMap
object.
Use specialized get
methods with a numeric vector, logical
vector, or cell array of headers to retrieve a subset of information from an object. For
example, to retrieve the first 10 elements from a BioRead
object, use
the getSubset
method:
newBRObj = getSubset(BRObj1, [1:10]);
This syntax returns a new BioRead
object containing the first 10
elements in the original BioRead
object.
For example, to retrieve the first 12 positions of sequences with headers SRR005164.1,
SRR005164.7, and SRR005164.16, use the getSubsequence
method:
subSeqs = getSubsequence(BRObj1, ... {'SRR005164.1', 'SRR005164.7', 'SRR005164.16'}, [1:12]')
subSeqs = 'TGGCTTTAAAGC' 'CCCGAAAGCTAG' 'AATTTTGCGGCT'
For example, to retrieve information about the third element in a
BioMap
object, use the getInfo
method:
Info_3 = getInfo(BMObj1, 3);
This syntax returns a tab-delimited character vector containing this information for the third element:
Sequence header
SAM flags for the sequence
Start position of the aligned read sequence with respect to the reference sequence
Mapping quality score for the sequence
Signature (CIGAR-formatted character vector) for the sequence
Sequence
Quality scores for sequence positions
To modify properties (other than Name
and
Reference
) of a BioRead
or
BioMap
object, the data must be in memory, and not indexed. To
ensure the data is in memory, do one of the following:
Construct the object from a structure as described in Construct a BioMap Object from a SAM or BAM Structure.
Construct the object from a source file using the InMemory
name-value pair argument.
First, create an object with the data in memory:
BRObj1 = BioRead('SRR005164_1_50.fastq','InMemory',true);
To provide custom headers for sequences of interest (in this case sequences 1 to 5), do the following:
BRObj1.Header(1:5) = {'H1', 'H2', 'H3', 'H4', 'H5'};
Alternatively, you can use the setHeader
method:
BRObj1 = setHeader(BRObj1, {'H1', 'H2', 'H3', 'H4', 'H5'}, [1:5]);
Several other specialized set
methods let you set the properties
of a subset of elements in a BioRead
or BioMap
object.
When working with a BioMap
object, you can determine the number of
read sequences that:
Align within a specific region of the reference sequence
Align to each position within a specific region of the reference sequence
For example, you can compute the number, indices, and start positions of the read
sequences that align within the first 25 positions of the reference sequence. To do so, use
the getCounts
, getIndex
, and getStart
methods:
Cov = getCounts(BMObj1, 1, 25)
Cov = 12
Indices = getIndex(BMObj1, 1, 25)
Indices = 1 2 3 4 5 6 7 8 9 10 11 12
startPos = getStart(BMObj1, Indices)
startPos = 1 3 5 6 9 13 13 15 18 22 22 24
The first two syntaxes return the number and indices of the read sequences that align within the specified region of the reference sequence. The last syntax returns a vector containing the start position of each aligned read sequence, corresponding to the position numbers of the reference sequence.
For example, you can also compute the number of the read sequences that align to
each of the first 10 positions of the reference sequence. For this
computation, use the getBaseCoverage
method:
Cov = getBaseCoverage(BMObj1, 1, 10)
Cov = 1 1 2 2 3 4 4 4 5 5
It is useful to construct and view the alignment of the read sequences that align to a
specific region of the reference sequence. It is also helpful to know which read sequences
align to this region in a BioMap
object.
For example, to retrieve the alignment of read sequences to the first 12 positions of
the reference sequence in a BioMap
object, use the getAlignment
method:
[Alignment_1_12, Indices] = getAlignment(BMObj2, 1, 12)
Alignment_1_12 = CACTAGTGGCTC CTAGTGGCTC AGTGGCTC GTGGCTC GCTC Indices = 1 2 3 4 5
Return the headers of the read sequences that align to a specific region of the reference sequence:
alignedHeaders = getHeader(BMObj2, Indices)
alignedHeaders = 'B7_591:4:96:693:509' 'EAS54_65:7:152:368:113' 'EAS51_64:8:5:734:57' 'B7_591:1:289:587:906' 'EAS56_59:8:38:671:758'
SAM- and BAM-formatted files include the status of 11 binary flags for each read
sequence. These flags describe different sequencing and alignment aspects of a read
sequence. For more information on the flags, see the SAM Format
Specification. The filterByFlag
method lets you filter the read
sequences in a BioMap
object by using these flags.
Construct a BioMap
object from a SAM-formatted file.
BMObj2 = BioMap('ex1.sam');
Use the filterByFlag
method to create a logical
vector indicating the read sequences in a BioMap
object that are
mapped.
LogicalVec_mapped = filterByFlag(BMObj2, 'unmappedQuery', false);
Use this logical vector and the getSubset
method to create a new BioMap
object
containing only the mapped read sequences.
filteredBMObj_1 = getSubset(BMObj2, LogicalVec_mapped);
Construct a BioMap
object from a SAM-formatted file.
BMObj2 = BioMap('ex1.sam');
Use the filterByFlag
method to create a logical
vector indicating the read sequences in a BioMap
object that are
mapped in a proper pair, that is, both the read sequence and its mate are mapped to
the reference sequence.
LogicalVec_paired = filterByFlag(BMObj2, 'pairedInMap', true);
Use this logical vector and the getSubset
method to create a new BioMap
object
containing only the read sequences that are mapped in a proper pair.
filteredBMObj_2 = getSubset(BMObj2, LogicalVec_paired);