bamread

Read data from BAM file

Syntax

BAMStruct = bamread(File,RefSeq,Range)
[BAMStruct,HeaderStruct] = bamread(File,RefSeq,Range)
... = bamread(File,RefSeq,Range,Name,Value)

Description

BAMStruct = bamread(File,RefSeq,Range) reads the alignment records in File, a BAM-formatted file, that align to RefSeq, a reference sequence, in the range specified by Range. It returns the alignment data in BAMStruct, a MATLAB® array of structures.

[BAMStruct,HeaderStruct] = bamread(File,RefSeq,Range) also returns the header information in HeaderStruct, a MATLAB structure.

... = bamread(File,RefSeq,Range,Name,Value) reads the alignment records with additional options specified by one or more Name,Value pair arguments.

Input Arguments

File

Character vector or string specifying a file name or path and file name of a BAM-formatted file. If you specify only a file name, that file must be on the MATLAB search path or in the Current Folder.

Note

The function requires the BAM file to be ordered, except when returning reads that are not mapped to any reference.

RefSeq

Either of the following:

  • Character vector or string specifying the name of a reference sequence in the BAM file.

  • Positive integer specifying the index of a reference sequence in the BAM file. This number is also the index of the reference sequence in the Reference field of the InfoStruct structure returned by baminfo.

Range

Two-element vector specifying the begin and end range positions on the reference sequence, RefSeq. Both values must be positive, and are one-based. The second value must be ≥ to the first value.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'Full'

Controls the return of only alignment records that are fully contained within the range specified by Range. Choices are true or false (default).

Default: false

'Tags'

Controls the reading of the optional tags in addition to the first 11 fields for each alignment in the BAM-formatted file. Choices are true (default) or false.

Default: true

'ToFile'

Character vector or string specifying a nonexisting file name or a path and file name for saving the alignment records in the specified range of a specific reference sequence. The ToFile name-value pair argument creates a SAM-formatted file. If you specify only a file name, the file is saved to the MATLAB Current Folder.

The SAM-formatted file is always one-based, even if you set the ZeroBased name-value pair argument to true. You can use the SAM-formatted file as input when creating a BioMap object.

'ZeroBased'

Logical specifying whether bamread uses zero-based indexing when reading a file. The logical controls the return of zero-based or one-based positions in the Position and MatePosition fields in BAMStruct. Choices are true or false (default), which returns one-based positions.

This name-value pair argument affects the Position and MatePosition fields of BAMStruct. It does not affect the Range input argument or the SAM file created when using the ToFile name-value pair argument. SAM files are always one-based.

Caution

If you plan to use the BAMStruct output argument to construct a BioMap object, make sure the ZeroBased name-value pair argument is false.

Default: false

Output Arguments

BAMStruct

An N-by-1 array of structures containing sequence alignment and mapping information from a BAM-formatted file, where N is the number of alignment records stored in the specified range. Each structure contains the following fields.

FieldDescription
QueryName

Name of the read sequence (if unpaired) or the name of sequence pair (if paired).

Flag

Integer indicating the bit-wise information that specifies the status of each of 11 flags described by the SAM format specification.

Tip

You can use the bitget function to determine the status of a specific SAM flag.

ReferenceIndex

Index of the reference sequence.

Tip

To convert this index to a reference name, see the Reference field in the HeaderStruct output argument

PositionPosition of the forward reference sequence where the leftmost base of the alignment of the read sequence starts. This position is zero-based or one-based, depending on the ZeroBased name-value pair argument.
MappingQualityInteger specifying the mapping quality score for the read sequence.
CigarStringCIGAR-formatted string representing how the read sequence aligns with the reference sequence.
MateReferenceIndexIndex of the reference sequence associated with the mate. If there is no mate, then this value is 0.
MatePositionPosition of the forward reference sequence where the leftmost base of the alignment of the mate of the read sequence starts. This position is zero-based or one-based, depending on the ZeroBased name-value pair argument.
InsertSizeThe number of base positions between the read sequence and its mate, when both are mapped to the same reference sequence. Otherwise, this value is 0.
SequenceCharacter vector containing the letter representations of the read sequence. It is the reverse complement if the read sequence aligns to the reverse strand of the reference sequence.
QualityCharacter vector containing the ASCII representation of the per-base quality score for the read sequence. The quality score is reversed if the read sequence aligns to the reverse strand of the reference sequence.
TagsList of applicable SAM tags and their values.

HeaderStruct

MATLAB structure containing header information for the BAM-formatted file in the following fields.

FieldDescription
NRefsNumber of reference sequences in the BAM-formatted file.
Reference

1-by-NRefs array of structures containing these fields:

  • Name — Name of the reference sequence.

  • Length — Length of the reference sequence.

Header*Structure containing the file format version, sort order, and group order.
SequenceDictionary*

Structure containing the:

  • Sequence name

  • Sequence length

  • Genome assembly identifier

  • MD5 checksum of sequence

  • URI of sequence

  • Species

ReadGroup*

Structure containing the:

  • Read group identifier

  • Sample

  • Library

  • Description

  • Platform unit

  • Predicted median insert size

  • Sequencing center

  • Date

  • Platform

Program*

Structure containing the:

  • Program name

  • Version

  • Command line

* These structures and their fields appear in the output structure only if they are present in the BAM file. The information in these structures depends on the information present in the BAM file.

Examples

collapse all

Read multiple alignment records from the ex1.bam file that align to two different reference sequences.

data1 = bamread('ex1.bam', 'seq1', [100 200])
data1=59×1 struct array with fields:
    QueryName
    Flag
    Position
    MappingQuality
    CigarString
    MatePosition
    InsertSize
    Sequence
    Quality
    Tags
    ReferenceIndex
    MateReferenceIndex

data2 = bamread('ex1.bam', 'seq2', [100 200])
data2=79×1 struct array with fields:
    QueryName
    Flag
    Position
    MappingQuality
    CigarString
    MatePosition
    InsertSize
    Sequence
    Quality
    Tags
    ReferenceIndex
    MateReferenceIndex

Read alignments from the ex1.bam file that are fully contained in the 100 to 200 bp range of the seq1 reference sequence.

data3 = bamread('ex1.bam', 'seq1', [100 200], 'full', true)
data3=30×1 struct array with fields:
    QueryName
    Flag
    Position
    MappingQuality
    CigarString
    MatePosition
    InsertSize
    Sequence
    Quality
    Tags
    ReferenceIndex
    MateReferenceIndex

Read alignments from the ex1.bam file that align to the 100 to 300 bp range of the seq1 reference sequence. Read the same alignments using zero-based indexing. Compare the position of the 27th record in the two outputs.

data_one = bamread('ex1.bam','seq1', [100 300]);
data_zero = bamread('ex1.bam','seq1', [100 300], 'zerobased', true);
data_one(27).Position
ans = uint32
    135
data_zero(27).Position
ans = uint32
    134

Tips

  • The bamread function requires a BAM file.

  • Use the baminfo function to investigate the size and content, including reference sequence names, of a BAM-formatted file before using the bamread function to read the file contents into a MATLAB array of structures.

  • If your BAM-formatted file is too large to read using available memory, try either of the following:

    • Use a smaller range.

    • Use bamread without specifying outputs, but using the ToFile Name,Value pair arguments to create a SAM-formatted file. You can then use samread with the BlockRead Name,Value pair arguments to read the SAM-formatted file. Or you can pass the SAM-formatted file to the BioIndexedFile constructor function to construct a BioIndexedFile object, which you can use to create a BioMap object.

  • Use the BAMStruct output argument that bamread returns to construct a BioMap object, which lets you explore, access, filter, and manipulate all or a subset of the data, before doing subsequent analyses or viewing the data.

References

[1] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Goncalo, A., and Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 16, 2078–2079.

Introduced in R2010b