Trim sequences based on specified criterion
seqtrim(
trims
the sequences in fastqFile
)fastqFile
and saves the trimmed
sequences in new FASTQ files. By default, the trimmed sequences are
saved under file names with the suffix '_trimmed'
appended.
If you do not specify any trimming criterion, the function trims sequences
using the default.
seqtrim(
uses
additional options specified by one or more fastqFile
,Name,Value
)Name,Value
pair
arguments.
[
returns a cell array outFiles
,nSeqTrimmed
,nSeqUntrimmed
]
= seqtrim(___)outFiles
with
the names of output files. nSeqTrimmed
and nSeqUntrimmed
represent
the numbers of sequences trimmed and untrimmed from each input file,
respectively.
Trim each sequence when the number of bases with quality below 20 is greater than 3 within a sliding window of size 25.
[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq', 'Method', 'MaxNumberLowQualityBases', ... 'Threshold', [3 20], 'WindowSize', 25);
Check the number of sequences that were trimmed.
nt
nt = 36
Check the number of sequences that were untrimmed.
unt
unt = 14
Trim the first 10 bases of each sequence.
[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','Termini', ... 'Threshold',[10 0]);
Trim the last 5 bases.
[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','Termini', ... 'Threshold',[0 5]);
Trim each sequence at position 50.
[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','BasePositions', ... 'Threshold',[1 50]);
Trim each sequence when the running average base quality becomes less than 20.
[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq','Method','MeanQuality', ... 'Threshold',20)
Trim each sequence when the percentage of bases with quality below 10 is more than 15.
[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq','Method','MaxPercentLowQualityBases', ... 'Threshold',[15 10])
fastqFile
— Names of FASTQ files with sequence and quality informationNames of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.
Example: 'SRR005164_1_50.fastq'
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'Method','MaxNumberLowQualityBases','Threshold',[3
20]
specifies to trim each sequence when the number of bases
with quality below 20 is greater than 3.'Method'
— Criterion to trim sequences 'MaxNumberLowQualityBases'
(default) | 'MaxPercentLowQualityBases'
| 'MeanQuality'
| 'BasePositions'
| 'Termini'
Criterion to trim sequences, specified as one of the following options. Specify only one trimming criterion per function call.
'MaxNumberLowQualityBases'
–
applies a maximum threshold on the number of low-quality bases allowed
before trimming a sequence starting at the 5'
end.
'MaxPercentLowQualityBases'
–
applies a maximum threshold on the percentage of low-quality bases
allowed before trimming a sequence starting at the 5'
end.
'MeanQuality'
– applies a
minimum threshold on the running average base quality allowed before
trimming a sequence starting at the 5'
end.
'BasePositions'
– trims each
sequence according to the base positions (first base and last base)
starting at the 5'
end.
'Termini'
– trims each sequence
from either the 5'
or 3'
end
or from both ends.
Use this name-value pair argument together with 'Threshold'
to
specify the appropriate threshold value. Depending on the trimming
criterion, the corresponding value for 'Threshold'
varies.
See the 'Threshold'
option for the default values.
Note
Sequences resulting in empty sequences after trimming are saved
in the output files as empty sequences. To remove empty sequences
from files, use the seqfilter
function
with the 'MinLength'
option set to the value of 1
.
Example: 'Method','MaxNumberLowQualityBases','Threshold',[5
15]
'Threshold'
— Threshold value for trimming criterionThreshold value for the trimming criterion, specified as a scalar
or vector. Use this name-value pair to define the threshold value
for the trimming criterion specified by 'Method'
.
Depending on the trimming criterion, the corresponding value
for 'Threshold'
can be a scalar or two-element
vector. If you do not specify 'Threshold'
, then
the function uses the default threshold value of the corresponding
method. For each trimming criterion, the function uses the encoding
format of the base quality specified by the 'Encoding'
name-value
pair argument.
'Method' | 'Threshold' | Default 'Threshold' value |
---|---|---|
'MaxNumberLowQualityBases' | Two-element vector [V1 V2] . V1 is
a nonnegative integer that specifies the maximum number of low-quality
bases allowed before trimming. V2 specifies the
minimum base quality. Any base with quality less than V2 is
considered a low-quality base. | [0 10] |
'MaxPercentLowQualityBases' | Two-element vector [V1 V2] . V1 is
a scalar between 0 and 100 that specifies the maximum percentage of
low quality bases allowed before trimming. V2 specifies
the minimum base quality. Any base with quality less than V2 is
considered a low-quality base. | [0 10] |
'MeanQuality' | Positive scalar that specifies the minimum threshold on the
running average base quality allowed before trimming a sequence starting
at the 5' end. | 0 |
'BasePositions' | Two-element vector To trim only
the To
trim only the | [1 Inf] , that is, each sequence is left
untrimmed. |
'Termini' | Two-element vector To trim V1 bases at the To
trim V2 bases at the | [0 0] , that is, each sequence is left untrimmed. |
Example: 'Method','MaxPercentLowQualityBases','Threshold',[10
20]
'WindowSize'
— Size of sliding window to apply filtering criterion to sequenceInf
(default) | positive integerSize of the sliding window to apply the trimming criterion to a sequence, specified as a positive integer. The size of the window corresponds to the number of bases that the function uses at one time to apply the criterion. Any given sequence is trimmed before the first base of the window that violates the given criterion.
The sliding window can be applied to the following methods:
'MaxNumberLowQualityBases'
,
'MaxPercentLowQualityBases'
, and
'MeanQuality'
.
Note
Sequences shorter than the size of the window are saved in the
output file as empty sequences. To remove empty sequences from files,
use the seqfilter
function
with the 'MinLength'
option set to the value of 1
.
Example: 'WindowSize',10
'Encoding'
— Base quality encoding format'Illumina18'
(default) | 'Sanger'
| 'Solexa'
| 'Illumina13'
| 'Illumina15'
Base quality encoding format, specified as a character vector or string.
Example: 'Encoding','Sanger'
'OutputDir'
— Relative or absolute path to output file directoryRelative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.
Example: 'OutputDir','F:\results'
'OutputSuffix'
— Suffix to use in output file name'_trimmed'
(default) | character vector | stringSuffix to use in the output file name, specified as a character vector or string. It is
inserted after the input file name and before the file extension. The
default is '_trimmed'
.
Example: 'OutputSuffix','_WindowSize10_trimmed'
'UseParallel'
— Boolean indicating whether to perform computation in parallelfalse
(default) | true
Boolean indicating whether to perform computation in parallel,
specified as true
or false
.
For parallel computing, you must have Parallel Computing Toolbox™. If a parallel pool does not exist, one is created automatically when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in serial mode.
Note
There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.
During parallel computations, the work is divided by files, not by sequences, meaning that, for a single large file, running in parallel does not make a difference.
Example: 'UseParallel',true
outFiles
— Output file namesOutput file names, returned as a cell array of character vectors.
nSeqTrimmed
— Number of sequences trimmed from each input fileNumber of sequences trimmed from each input file, returned as
a scalar or an n-by-1
vector
where n is the number of input files. If there
are multiple input files, the order within nSeqTrimmed
corresponds
to the order of the input files.
nSeqUntrimmed
— Number of sequences untrimmed from each input fileNumber of sequences untrimmed from each input file, returned
as a scalar or an n-by-1
vector
where n is the number of input files. If there
are multiple input files, the order within nSeqUntrimmed
corresponds
to the order of the input files.
To run in parallel, set 'UseParallel'
to true
.
For more information, see the 'UseParallel'
name-value pair argument.
You have a modified version of this example. Do you want to open this example with your edits?