Assemble transcriptome from aligned reads
cufflinks(
assembles a
transcriptome from aligned reads in alignmentFiles
)alignmentFile
and quantifies the
level of expression for each transcript [1]. By default, the
function writes the results to a GTF file named transcripts.gtf
in the
current directory.
cufflinks
requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download
link. For details, see Bioinformatics Toolbox Software Support Packages.
Note
cufflinks
is supported on the Mac and UNIX® platforms only.
cufflinks(
uses additional options specified by alignmentFiles
,cufflinksOptions
)cufflinksOptions
.
cufflinks(
uses additional options specified by one or more name-value pair arguments. For example,
alignmentFiles
,Name,Value
)cufflinks(alignmentFile,'TrimCoverageThreshold',5)
specifies the
minimum average coverage for 3' end trimming.
[
returns the file names of the assembled transcriptome using any of the input argument
combinations from the previous syntaxes. By default, the function saves all files to the
current directory.transcripts
,isoforms
,genes
,skippedTranscripts
] = cufflinks(___)
Create a CufflinksOptions
object to define cufflinks options, such
as the number of parallel threads and the output directory to store the results.
cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";
The SAM files provided for this example contain aligned reads for Mycoplasma
pneumoniae from two samples with three replicates each. The reads are
simulated 100bp-reads for two genes (gyrA
and
gyrB
) located next to each other on the genome. All the reads are
sorted by reference position, as required by cufflinks
.
sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",... "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];
Assemble the transcriptome from the aligned reads.
[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);
gtfs
is a list of GTF files that contain assembled isoforms.
Compare the assembled isoforms using cuffcompare
.
stats = cuffcompare(gtfs);
Merge the assembled transcripts using cuffmerge
.
mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');
mergedGTF
reports only one transcript. This is because the two
genes of interest are located next to each other, and cuffmerge
cannot distinguish two distinct genes. To guide cuffmerge
, use a
reference GTF (gyrAB.gtf
) containing information about these two
genes. If the file is not located in the same directory that you run
cuffmerge
from, you must also specify the file path.
gyrAB = which('gyrAB.gtf'); mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',... 'ReferenceGTF',gyrAB);
Calculate abundances (expression levels) from aligned reads for each sample.
abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... 'OutputDirectory','./cuffquantOutput1'); abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],... 'OutputDirectory','./cuffquantOutput2');
Assess the significance of changes in expression for genes and transcripts between
conditions by performing the differential testing using cuffdiff
.
The cuffdiff
function operates in two distinct steps: the function
first estimates abundances from aligned reads, and then performs the statistical
analysis. In some cases (for example, distributing computing load across multiple
workers), performing the two steps separately is desirable. After performing the first
step with cuffquant
, you can then use the binary CXB output file as
an input to cuffdiff
to perform statistical analysis. Because
cuffdiff
returns several files, specify the output directory is
recommended.
isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],... 'OutputDirectory','./cuffdiffOutput');
Display a table containing the differential expression test results for the two genes
gyrB
and gyrA
.
readtable(isoformDiff,'FileType','text')
ans = 2×14 table test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2_fold_change_ test_stat p_value q_value significant ________________ _____________ ______ _______________________ ________ ________ ______ __________ __________ _________________ _________ _______ _______ ___________ 'TCONS_00000001' 'XLOC_000001' 'gyrB' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 1.0913e+05 4.2228e+05 1.9522 7.8886 5e-05 5e-05 'yes' 'TCONS_00000002' 'XLOC_000001' 'gyrA' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 3.5158e+05 1.1546e+05 -1.6064 -7.3811 5e-05 5e-05 'yes'
You can use cuffnorm
to generate normalized expression tables for
further analyses. cuffnorm
results are useful when you have many
samples and you want to cluster them or plot expression levels for genes that are
important in your study. Note that you cannot perform differential expression analysis
using cuffnorm
.
Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.
alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]} isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,... 'OutputDirectory', './cuffnormOutput');
Display a table containing the normalized expression levels for each transcript.
readtable(isoformNorm,'FileType','text')
ans = 2×7 table tracking_id q1_0 q1_2 q1_1 q2_1 q2_0 q2_2 ________________ __________ __________ __________ __________ __________ __________ 'TCONS_00000001' 1.0913e+05 78628 1.2132e+05 4.3639e+05 4.2228e+05 4.2814e+05 'TCONS_00000002' 3.5158e+05 3.7458e+05 3.4238e+05 1.0483e+05 1.1546e+05 1.1105e+05
Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.
alignmentFiles
— Names of SAM or BAM filesNames of SAM or BAM files, specified as a string, string vector, character vector, or cell array of character vectors. The input files must be sorted by reference position.
Example: 'Myco_1_1.sam'
Data Types: char
| string
cufflinksOptions
— Cufflinks optionsCufflinksOptions
object | character vector | stringCufflinks options, specified as a CufflinksOptions
object,
character vector, or string. The character vector or string must be in the cufflinks
option syntax (prefixed by one or two dashes) [1].
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
cufflinks(alignmentFile,'TrimCoverageThreshold',5,'FragmentLengthMean',180)
'EffectiveLengthCorrection'
— Flag to normalize fragment countstrue
(default) | false
Flag to normalize fragment counts to fragments per kilobase per million mapped reads (FPKM), specified as true
or false
.
Example:
'EffectiveLengthCorrection',false
Data Types: logical
'ExtraCommand'
— Additional commands""
(default) | string | character vectorAdditional commands, specified as a string or character vector.
The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB properties.
Example: 'ExtraCommand','--library-type
fr-secondstrand'
Data Types: char
| string
'FauxReadTiling'
— Flag to include reference transcripts in assembled outputtrue
(default) | false
Flag to include reference transcripts in the assembled output
as faux-reads during RABT (advanced reference annotation based transcript) assembly, specified
as true
or false
.
Note
The function only performs the RABT assembly if you specify
GTFGuide
. Otherwise, FauxReadTiling
,
regardless of being true
or false
, has no effect
on the assembled transcript.
Example: 'FauxReadTiling',false
Data Types: logical
'FragmentBiasCorrection'
— Name of FASTA file with reference transcripts to detect biasName of the FASTA file with reference transcripts to detect bias in fragment counts, specified as a string or character vector. Library preparation can introduce sequence-specific bias into RNA-Seq experiments. Providing reference transcripts improves the accuracy of the transcript abundance estimates.
Example: 'FragmentBiasCorrection','ref.fasta'
Data Types: char
| string
'FragmentLengthMean'
— Expected mean fragment length200
(default) | positive integerExpected mean fragment length, specified as a positive integer.
The default value is 200
base pairs. The function can learn the fragment
length mean for each SAM file. Using this option is not recommended for paired-end reads.
Example: 'FragmentLengthMean',100
Data Types: double
'FragmentLengthSD'
— Expected standard deviation for fragment length distribution80
(default) | positive scalarExpected standard deviation for the fragment length
distribution, specified as a positive scalar. The default value is 80
base
pairs. The function can learn the fragment length standard deviation for each SAM file. Using
this option is not recommended for paired-end reads.
Example: 'FragmentLengthSTD',70
Data Types: double
'GTFGuide'
— Name of GTF file to guide RABT assemblyName of a GTF file to guide the RABT assembly, specified as a string or character vector.
Example: 'GTFGuide','tr.gtf'
Data Types: char
| string
'IncludeAll'
— Flag to apply all available optionsfalse
(default) | trueFlag to include all available options with the corresponding default values when
converting to the original options syntax, specified as true
or
false
.
The original (native) syntax is prefixed by one or two dashes.
By default, the function converts only the specified options. If the value is
true
, the software converts all available options, with default values
for unspecified options, to the original syntax.
Note
If you set IncludeAll
to true
, the software
translates all available properties, with default values for unspecified properties. The
only exception is that when the default value of a property is NaN
,
Inf
, []
, ''
, or
""
, then the software does not translate the corresponding
property.
Example: 'IncludeAll',true
Data Types: logical
'RABTOverhangTolerance'
— Number of base pairs allowed to overlap with transcript intron8
(default) | positive integerNumber of base pairs from a read allowed to overlap with a
transcript intron when determining if a read is mappable to another transcript during the RABT
assembly, specified as a positive integer. The default value is 8
.
Note
The function only performs the RABT assembly if you specify
GTFGuide
. Otherwise, RABTOverhangTolerance
has
no effect on the assembled transcript.
Example: 'IntronOverhangTolerance',10
Data Types: double
'JunctionAlpha'
— Alpha value in binomial test to filter false-positive alignments0.001
(default) | scalar between 0
and 1
Alpha value in the binomial test to filter false-positive
alignments, specified as a scalar between 0
and 1
.
Example: 'JunctionAlpha',0.005
Data Types: double
'LengthCorrection'
— Flag to correct by transcript lengthtrue
(default) | false
Flag to correct by the transcript length, specified as
true
or false
. Set this value to
false
only when the fragment count is independent of the feature size,
such as for small RNA libraries with no fragmentation and for 3' end sequencing, where all
fragments have the same length.
Example: 'LengthCorrection',false
Data Types: logical
'MaskFile'
— Name of GTF or GFF file containing transcripts to ignoreName of the GTF or GFF file containing transcripts to ignore during analysis, specified as a string or character vector. Some examples of transcripts to ignore include annotated rRNA transcripts, mitochondrial transcripts, and other abundant transcripts. Ignoring these transcripts improves the robustness of the abundance estimates.
Example: 'MaskFile','excludes.gtf'
Data Types: char
| string
'MaxBundleFrags'
— Maximum number of fragments to include for each locus before skipping500000
(default) | positive integerMaximum number of fragments to include for each locus before
skipping new fragments, specified as a positive integer. Skipped fragments are marked with the
status HIDATA
in the file skipped.gtf
.
Example: 'MaxBundleFrags',400000
Data Types: double
'MaxBundleLength'
— Maximum genomic length in base pairs for bundle3500000
(default) | positive integerMaximum genomic length in base pairs for a bundle, specified as a positive integer.
Example: 'MaxBundleLength',3400000
Data Types: double
'MaxFragAlignments'
— Maximum number of aligned reads to include for each fragmentInf
(default) | positive integerMaximum number of aligned reads to include for each fragment
before skipping new reads, specified as a positive integer. Inf
, the default
value, sets no limit on the maximum number of aligned reads.
Example: 'MaxFragAlignments',1000
Data Types: double
'MaxIntronLength'
— Maximum number of bases in intron300000
(default) | positive integerMaximum number of bases in an intron to report, specified as a positive integer. cufflinks
also ignores SAM alignments with REF_SKIP CIGAR operations longer than this property value.
Example: 'MaxIntronLength',350000
Data Types: double
'MaxMLEIterations'
— Maximum number of iterations for maximum likelihood estimation5000
(default) | positive integerMaximum number of iterations for the maximum likelihood estimation of abundances, specified as a positive integer.
Example: 'MaxMLEIterations',4000
Data Types: double
'MinFragsPerTransfrag'
— Minimum number of aligned RNA-Seq fragments to report10
(default) | positive integerMinimum number of aligned RNA-Seq fragments to report on an assembled transfrag, specified as a positive integer.
Example: 'MinFragsPerTransfrag',15
Data Types: double
'MinIntronLength'
— Minimum number of base pairs for intron in genome50
(default) | positive integerMinimum number of base pairs for an intron in the genome, specified as a positive integer.
Example: 'MinIntronLength',50
Data Types: double
'MinIsoformFraction'
— Cuffoff value to report abundance of isoform0.1
(default) | scalar between 0
and 1
Cuffoff value to report the abundance of a particular isoform
as a fraction of the most abundant isoform (major isoform), specified as a scalar between
0
and 1
. The function filters out transcripts with
abundances below the specified value because isoforms expressed at low levels often cannot be
assembled reliably. The default value is 0.1, or 10%, of the major isoform of the gene.
Example: 'MinIsoformFraction',0.20
Data Types: double
'MultiReadCorrection'
— Flag to improve abundance estimation using rescue methodfalse
(default) | true
Flag to improve abundance estimation for reads mapped to
multiple genomic positions using the rescue method, specified as true
or
false
. If the value is false
, the function divides
multimapped reads uniformly to all mapped positions. If the value is true
,
the function uses additional information, including gene abundance estimation, inferred fragment
length, and fragment bias, to improve transcript abundance estimation.
The rescue method is described in [2].
Example: true
Data Types: logical
'NormalizeCompatibleHits'
— Flag to use only fragments compatible with reference transcript to calculate FPKM valuesfalse
(default) | true
Flag to use only fragments compatible with a reference
transcript to calculate FPKM values, specified as true
or
false
.
Example: 'NormalizeCompatibleHits',false
Data Types: logical
'NormalizeTotalHits'
— Flag to include all fragments to calculate FPKM valuesfalse
(default) | true
Flag to include all fragments to calculate FPKM values,
specified as true
or false
. If the value is
true
, the function includes all fragments, including fragments without a
compatible reference.
Example: 'NormalizeTotalHits',true
Data Types: logical
'NumFragAssignmentDraws'
— Number of fragment assignments to perform on each transcript50
(default) | positive integerNumber of fragment assignments to perform on each transcript, specified as a positive integer. For each fragment drawn from a transcript, the function performs the specified number of assignments probabilistically to determine the transcript assignment uncertainty and to estimate the variance-covariance matrix for the assigned fragment counts.
Example: 'NumFragAssignmentSamples',40
Data Types: double
'NumFragDraws'
— Number of draws from negative binomial random number generator100
(default) | positive integerNumber of draws from the negative binomial random number generator for each transcript, specified as a positive integer. Each draw is a number of fragments that the function probabilistically assigns to transcripts in the transcriptome to determine the assignment uncertainty and to estimate the variance-covariance matrix for assigned fragment counts.
Example: 'NumFragSamples',90
Data Types: double
'NumThreads'
— Number of parallel threads to use1
(default) | positive integerNumber of parallel threads to use, specified as a positive integer. Threads are run on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but increases the memory footprint.
Example: 'NumThreads',4
Data Types: double
'OutputDirectory'
— Directory to store analysis results"./"
) (default) | string | character vectorDirectory to store analysis results, specified as a string or character vector.
Example: 'OutputDirectory',"./AnalysisResults/"
Data Types: char
| string
'OverhangTolerance'
— Number of base pairs of overlap with intron8
(default) | positive integerNumber of base pairs of overlap with an intron that the function allows when determining if the read is compatible with another transcript, specified as a positive integer.
Example: 'OverhangTolerance',5
Data Types: double
'RABTOverhangTolerance3'
— Number of base pairs allowed to overhang 3' end of reference transcript600
(default) | positive integerNumber of base pairs allowed to overhang the 3' end of each reference transcript during the RABT assembly, specified as a positive integer. The function uses this property when deciding if an assembled transcript is novel or should be merged with the reference.
Note
The function only performs the RABT assembly if you specify
GTFGuide
. Otherwise, RABTOverhangTolerance3
has no effect on the assembled transcript.
Example: 'OverhangTolerance3',500
Data Types: double
'OverlapRadius'
— Distance between transfrags50
(default) | positive integerDistance between transfrags, specified as a positive integer.
If the distance is below the specified value, the function merges the transfrags. The default
value is 50
base pairs.
Example: 'OverlapRadius',40
Data Types: double
'PreMRNAFraction'
— Threshold to include alignments in intronic intervals0.15
(default) | scalar between 0
and 1
Threshold to include alignments in intronic intervals in the
assembly, specified as a scalar between 0
and 1
. The
function ignores the intronic alignments if the minimum depth of coverage divided by the number
of spliced reads is below the specified value. Use this property to filter reads originating
from incompletely spliced transcripts.
Example: 'PreMRNAFraction',0.10
Data Types: double
'ReferenceGTF'
— Name of GTF or GFF file containing reference annotation Name of a GTF or GFF file containing reference annotation used
to estimate isoform expression, specified as a string or character vector. If you provide a
ReferenceGTF
file, the function does not assemble any novel transcripts
and ignores any alignments incompatible with the reference transcripts.
Example: 'ReferenceGTF',"isoest.gtf"
Data Types: char
| string
'Seed'
— Seed for random number generator0
(default) | nonnegative integerSeed for the random number generator, specified as a nonnegative integer. Setting a seed value ensures the reproducibility of the analysis results.
Example: 'Seed',10
Data Types: double
'SmallAnchorFraction'
— Minimum percentage of alignment on each side of splice junction0.09
(default) | scalar between 0
and 1
Minimum percentage of alignment on each side of a splice
junction, specified as a scalar between 0
and 1
. The
function filters alignments with a percentage smaller than this property value prior to
assembly.
Example: 'SmallAnchorFraction',0.1
Data Types: double
'TranscriptPrefix'
— Prefix for reported transfrags in output GTF file"CUFF"
(default) | string | character vectorPrefix for the reported transfrags in the output GTF file, specified as a string or character vector. This option must be a string or character vector with a non-zero length.
Example: 'TranscriptPrefix',"tfrags"
Data Types: char
| string
'TrimCoverageThreshold'
— Minimum average coverage needed for 3' trimming10
(default) | positive integerMinimum average coverage for 3' trimming, specified as a positive integer.
Example: 'TrimCoverageThreshold',8
Data Types: double
'TrimDropoffFraction'
— Minimum percentage of average coverage0.1
(default) | scalar between 0
and 1
Minimum percentage of average coverage for trimming the 3' end
of assembled transcripts, specified as a scalar between 0
and
1
.
Example: 'TrimDropoffFraction',0.15
Data Types: double
transcripts
— Transcript file name"./transcripts.gtf"
(default)Transcript file name, returned as a string. The name of the file is
"transcripts.gtf"
. The file contains the assembled isoforms, along
with attributes describing the abundance of reads originating from each
transcript.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/transcripts.gtf"
.
isoforms
— Estimated isoform-level expression file name"./isoforms.fpkm_tracking"
(default)Estimated isoform-level expression file name, returned as a string. By default, the
name of the file is "isoforms.fpkm_tracking"
. The file contains
estimates for isoform-level expression in cufflinks
FPKM tracking
format.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/isoforms.fpkm_tracking"
.
genes
— Estimated gene-level expression file name"./genes.fpkm_tracking"
(default)Estimated gene-level expression file name, returned as a string. By default, the
name of the file is "genes.fpkm_tracking"
. The file contains
estimates for gene-level expression in cufflinks
FPKM tracking
format.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/genes.fpkm_tracking"
.
skippedTranscripts
— Name of file containing skipped transcripts"./skipped.gtf"
(default)Name of the file containing skipped transcripts when processing a locus, returned as
a string. By default, the name of the file is "skipped.gtf"
. The
'MaxBundleFrags'
option specifies the maximum number of transcripts
(fragments) to include for each locus. After reaching the threshold, the function puts
the skipped fragments in this file.
The output string also includes the directory information defined by
OutputDirectory
. The default is the current directory. If you set
OutputDirectory
to "/local/tmp/"
, the output
becomes "/local/tmp/skipped.gtf"
.
[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.
[2] Mortazavi, Ali, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq.” Nature Methods 5, no. 7 (July 2008): 621–28. https://doi.org/10.1038/nmeth.1226.
bowtie2
| bwamem
| cuffcompare
| cuffdiff
| cuffgffread
| cuffgtf2sam
| CufflinksOptions
| cuffmerge
| cuffnorm
| cuffquant