Merge RNA-seq assemblies into a master transcriptome
merges assembled transcriptome from two or more GTF files [1]. Merging GTF files is a
required step to perform the downstream differential analysis with
mergedGTF
= cuffmerge(gtfFiles
)cuffdiff
.
cuffmerge
requires Python® 2 installed in your system.
cuffmerge
requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download
link. For details, see Bioinformatics Toolbox Software Support Packages.
Note
cuffmerge
is supported on the Mac and UNIX® platforms only.
uses additional options specified by one or more name-value pair arguments. For example,
mergedGTF
= cuffmerge(gtfFiles
,Name,Value
)cuffmerge(["Myco_1_1.transcripts.gtf","Myco_1_2.transcripts.gtf"],'NumThreads',5)
specifies to use five parallel threads.
Create a CufflinksOptions
object to define cufflinks options, such
as the number of parallel threads and the output directory to store the results.
cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";
The SAM files provided for this example contain aligned reads for Mycoplasma
pneumoniae from two samples with three replicates each. The reads are
simulated 100bp-reads for two genes (gyrA
and
gyrB
) located next to each other on the genome. All the reads are
sorted by reference position, as required by cufflinks
.
sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",... "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];
Assemble the transcriptome from the aligned reads.
[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);
gtfs
is a list of GTF files that contain assembled isoforms.
Compare the assembled isoforms using cuffcompare
.
stats = cuffcompare(gtfs);
Merge the assembled transcripts using cuffmerge
.
mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');
mergedGTF
reports only one transcript. This is because the two
genes of interest are located next to each other, and cuffmerge
cannot distinguish two distinct genes. To guide cuffmerge
, use a
reference GTF (gyrAB.gtf
) containing information about these two
genes. If the file is not located in the same directory that you run
cuffmerge
from, you must also specify the file path.
gyrAB = which('gyrAB.gtf'); mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',... 'ReferenceGTF',gyrAB);
Calculate abundances (expression levels) from aligned reads for each sample.
abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... 'OutputDirectory','./cuffquantOutput1'); abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],... 'OutputDirectory','./cuffquantOutput2');
Assess the significance of changes in expression for genes and transcripts between
conditions by performing the differential testing using cuffdiff
.
The cuffdiff
function operates in two distinct steps: the function
first estimates abundances from aligned reads, and then performs the statistical
analysis. In some cases (for example, distributing computing load across multiple
workers), performing the two steps separately is desirable. After performing the first
step with cuffquant
, you can then use the binary CXB output file as
an input to cuffdiff
to perform statistical analysis. Because
cuffdiff
returns several files, specify the output directory is
recommended.
isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],... 'OutputDirectory','./cuffdiffOutput');
Display a table containing the differential expression test results for the two genes
gyrB
and gyrA
.
readtable(isoformDiff,'FileType','text')
ans = 2×14 table test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2_fold_change_ test_stat p_value q_value significant ________________ _____________ ______ _______________________ ________ ________ ______ __________ __________ _________________ _________ _______ _______ ___________ 'TCONS_00000001' 'XLOC_000001' 'gyrB' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 1.0913e+05 4.2228e+05 1.9522 7.8886 5e-05 5e-05 'yes' 'TCONS_00000002' 'XLOC_000001' 'gyrA' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 3.5158e+05 1.1546e+05 -1.6064 -7.3811 5e-05 5e-05 'yes'
You can use cuffnorm
to generate normalized expression tables for
further analyses. cuffnorm
results are useful when you have many
samples and you want to cluster them or plot expression levels for genes that are
important in your study. Note that you cannot perform differential expression analysis
using cuffnorm
.
Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.
alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]} isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,... 'OutputDirectory', './cuffnormOutput');
Display a table containing the normalized expression levels for each transcript.
readtable(isoformNorm,'FileType','text')
ans = 2×7 table tracking_id q1_0 q1_2 q1_1 q2_1 q2_0 q2_2 ________________ __________ __________ __________ __________ __________ __________ 'TCONS_00000001' 1.0913e+05 78628 1.2132e+05 4.3639e+05 4.2228e+05 4.2814e+05 'TCONS_00000002' 3.5158e+05 3.7458e+05 3.4238e+05 1.0483e+05 1.1546e+05 1.1105e+05
Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.
gtfFiles
— Names of GTF filesNames of GTF files, specified as a string vector or cell array of character vectors.
Example: ["Myco_1_1.transcripts.gtf",
"Myco_1_2.transcripts.gtf"]
Data Types: string
| cell
opt
— cuffgffread
optionsCuffMergeOptions
object | string | character vectorcuffgffread
options, specified as a
CuffMergeOptions
object, string, or character vector. The string or
character vector must be in the original cuffmerge
option syntax
(prefixed by one or two dashes) [1].
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
cuffmerge(["Myco_1_1.transcripts.gtf","Myco_1_2.transcripts.gtf"],'NumThreads',5)
'ExtraCommand'
— Additional commands""
(default) | string | character vectorThe commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB properties.
Example: 'ExtraCommand','--library-type
fr-secondstrand'
Data Types: char
| string
'IncludeAll'
— Flag to apply all available optionsfalse
(default) | true
The original (native) syntax is prefixed by one or two dashes.
By default, the function converts only the specified options. If the value is
true
, the software converts all available options, with default values
for unspecified options, to the original syntax.
Note
If you set IncludeAll
to true
, the software
translates all available properties, with default values for unspecified properties. The
only exception is that when the default value of a property is NaN
,
Inf
, []
, ''
, or
""
, then the software does not translate the corresponding
property.
Example: 'IncludeAll',true
Data Types: logical
'MinIsoformFraction'
— Minimum abundance of isoform to be included in merged assembly0.5
(default) | scalar between 0
and 1
Minimum abundance of an isoform to be included in the merged assembly, specified as a scalar between 0
and 1
. This value is expressed as a percentage of the most abundant (major) isoform.
Example:
'MinIsoformFraction',0.4
Data Types: double
'NumThreads'
— Number of parallel threads to use1
(default) | positive integerNumber of parallel threads to use, specified as a positive integer. Threads are run on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but increases the memory footprint.
Example: 'NumThreads',4
Data Types: double
'OutputDirectory'
— Directory to store analysis results"./"
) (default) | string | character vectorDirectory to store analysis results, specified as a string or character vector.
Example: 'OutputDirectory',"./AnalysisResults/"
Data Types: char
| string
'ReferenceGTF'
— Name of optional reference annotation GTF fileName of an optional reference annotation GTF file to be included in the combined assembly, specified as a string or character vector.
Example: 'ReferenceGTF',"ref.gtf"
Data Types: char
| string
'ReferenceSequence'
— Name of directory or FASTA file containing genomic sequencesName of a directory or FASTA file containing genomic DNA sequences for the reference, specified as a string or character vector.
If you specify a directory, it must contain one FASTA file per contig. In other
words, the directory must contain one FASTA file per reference chromosome, and each
file must be named after the chromosome and have a .fa
or
.fasta
extension.
If you specify a FASTA file, it must contain all the reference sequences.
The function uses the provided sequences to improve transfrag classification and exclude artifacts.
Example:
'ReferenceSequence',"allrefs.fasta"
Data Types: char
| string
mergedGTF
— Name of output GTF file"./merged_asm/merged.gtf"
Name of the output GTF file containing the merged transcriptome, returned as a string.
The output string also includes the directory information defined by
OutputDirectory
. By default, the function
Creates the merged_asm subfolder in the current directory
and saves the output file (merged.gtf
) in that folder.
Creates a subfolder named logs inside merged_asm folder and saves a log file.
If you set OutputDirectory
to "/local/tmp/"
,
mergedGTF
becomes "/local/tmp/merged.gtf"
. The
function also creates the logs folder inside the specified output
directory.
[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.