cigar2align

Convert unaligned sequences to aligned sequences using signatures in CIGAR format

Syntax

Alignment = cigar2align(Seqs,Cigars)
[GapSeq, Indices] = cigar2align(Seqs,Cigars)
... = cigar2align(Seqs,Cigars,Name,Value)

Description

Alignment = cigar2align(Seqs,Cigars) converts unaligned sequences in Seqs, a cell array of character vectors or string vector, into Alignment, a matrix of aligned sequences, using the information stored in Cigars, a cell array of CIGAR–formatted character vectors or string vector.

[GapSeq, Indices] = cigar2align(Seqs,Cigars) converts unaligned sequences in Seqs, a cell array of character vectors or string vector, into GapSeq, a cell array of character vectors of aligned sequences, and also returns Indices, a vector of numeric indices, using the information stored in Cigars, a cell array of CIGAR–formatted character vectors or string vector. When an alignment has many columns, this syntax uses less memory and is faster.

... = cigar2align(Seqs,Cigars,Name,Value) converts unaligned sequences in Seqs, a cell array of character vectors or string vector, into Alignment, a matrix of aligned sequences, using the information stored in Cigars, a cell array of CIGAR–formatted character vectors or string vector, with additional options specified by one or more Name,Value pair arguments.

Input Arguments

Seqs

Cell array of character vectors or string vector containing unaligned sequences. Seqs must contain the same number of elements as Cigars.

Cigars

Cell array of valid CIGAR–formatted character vectors or string vector. Cigars must contain the same number of elements as Seqs.

Name-Value Pair Arguments

Specify optional comma-separated pairs of Name,Value arguments. Name is the argument name and Value is the corresponding value. Name must appear inside quotes. You can specify several name and value pair arguments in any order as Name1,Value1,...,NameN,ValueN.

'Start'

Vector of positive integers specifying the reference sequence position at which each aligned sequence starts. By default, each aligned sequence starts at position 1 of the reference sequence.

'GapsInRef'

Logical specifying whether to display positions in the aligned sequences that correspond to gaps in the reference sequence. Choices are true (1) or false (0). If your reference sequence has gaps and you set GapsInRef to false (0), and then later use Alignment as input to align2cigar, the returned CIGAR–formatted character vectors will not match the original ones.

Default: false (0)

'SoftClipping'

Logical specifying whether to include characters in the aligned read sequences corresponding to soft clipping ends. Choices are true (1) or false (0).

Default: false (0)

'OffsetPad'

Logical specifying whether to add padding blanks to the left of each aligned read sequence to represent the offset of the start position from the first position of the reference sequence. Choices are true (1) or false (0). When false, the matrix of aligned sequences starts at the start position of the leftmost aligned read sequence.

Default: false (0)

Output Arguments

Alignment

Matrix of aligned sequences, in which the number of rows equals the number of character vectors in Seqs.

GapSeq

Cell array of character vectors of aligned sequences, in which the number character vectors equals the number of character vectors in Seqs.

Indices

Vector of numeric indices indicating the starting column for each aligned sequence in Alignment. These indices are not necessarily the same as the start positions in the reference sequence for each aligned sequence. This is because either of the following:

  • The reference sequence can be extended to account for insertions.

  • An aligned sequence can have leading soft clippings, padding, or insertion characters.

Examples

Create a cell array of character vectors containing unaligned sequences, create a cell array of corresponding CIGAR–formatted character vectors associated with a reference sequence of ACGTATGC, and then reconstruct the alignment:

r = {'ACGACTGC', 'ACGTTGC', 'AGGTATC'}; % unaligned sequences
c = {'3M1D1M1I3M', '4M1D1P3M', '5M1P1M1D1M'}; % cigar-formatted
aln1 = cigar2align(r, c)
aln1 =

ACG-ATGC
ACGT-TGC
AGGTAT-C

Reconstruct the same alignment to display positions in the aligned sequences that correspond to gaps in the reference sequence:

aln2 = cigar2align(r, c,'GapsInRef',true)
aln2 =

ACG-ACTGC
ACGT--TGC
AGGTA-T-C

Reconstruct the alignment adding an offset padding of 5:

aln3 = cigar2align(r, c, 'start', [5 5 5], 'OffsetPad', true)
aln3 =

    ACG-ATGC
    ACGT-TGC
    AGGTAT-C

Algorithms

When cigar2align reconstructs the alignment, it does not display hard clipped positions (H) or soft clipped positions (S). Also, it does not consider soft clipped positions as start positions for aligned sequences.

Alternatives

If your CIGAR information is captured in the Signature property of a BioMap object, you can use the getAlignment method to construct the alignment.

Introduced in R2010b