Filter and convert GFF and GTF files
cuffgffread(
reads the input
,output
)input
GFF or GTF file and writes the mandatory columns to the
output
GFF file [1]. The function can also
return the GTF-format file using the 'GTFOutput'
option.
cuffgffread
requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download
link. For details, see Bioinformatics Toolbox Software Support Packages.
Note
cuffgffread
is supported on the Mac and UNIX® platforms only.
cuffgffread(
uses additional options specified by one or more name-value pair arguments. For example,
input
,output
,Name,Value
)cuffgffread('gyrAB.gtf','gyrAB.gff','PreserveAttributes',true)
retains
all attributes in the output file.
Convert a GTF file to a GFF file while retaining all attributes.
cuffgffread('gyrAB.gtf','gyrABOut.gff','PreserveAttributes',true)
You can also set the options using an object. For instance, specify the output to be in the GTF format.
opt = CuffGFFReadOptions; opt.GTFOutput = true; opt.PreserveAttributes = true; cuffgffread('gyrAB.gtf','gyrABOut.gtf',opt);
Once you have the options object, you can retrieve the equivalent original options for all object properties using getOptionsTable
.
getOptionsTable(opt)
ans = 33×3 table PropertyName FlagName FlagShortName ___________________________ ________________ _____________ AppendDescription 'AppendDescription' '-A' '' CheckOppositeStrand 'CheckOppositeStrand' '-B' '' CheckPhase 'CheckPhase' '-H' '' Cluster 'Cluster' '--cluster-only' '' CodingOnly 'CodingOnly' '-C' '' CollapseContainer 'CollapseContainer' '-K' '' CollapseFull 'CollapseFull' '-Q' '' CoordinateRange 'CoordinateRange' '-r' '' DiscardInvalidCDS 'DiscardInvalidCDS' '-J' '' DiscardNonCanonicalSplice 'DiscardNonCanonicalSplice' '-N' '' DiscardSingleExon 'DiscardSingleExon' '-U' '' DiscardTerminatedCDS 'DiscardTerminatedCDS' '-V' '' FastaCDSFile 'FastaCDSFile' '-x' '' FastaExonsFile 'FastaExonsFile' '-w' '' FastaProteinFile 'FastaProteinFile' '-y' '' FirstExonOnly 'FirstExonOnly' '-G' '' ForceExons 'ForceExons' '--force-exons' '' FullyContained 'FullyContained' '-R' '' GTFOutput 'GTFOutput' '-T' '' MaxIntronLength 'MaxIntronLength' '-i' '' Merge 'Merge' '--merge' '-M' MergeCloseExons 'MergeCloseExons' '-Z' '' MergeInfoFile 'MergeInfoFile' '-d' '' PreserveAttributes 'PreserveAttributes' '-F' '' Pseudo 'Pseudo' '--no-pseudo' '' ReplacementTable 'ReplacementTable' '-m' '' SequenceFile 'SequenceFile' '-g' '' SequenceInfo 'SequenceInfo' '-s' '' UrlDecode 'UrlDecode' '-D' '' UseEnsemblConversion 'UseEnsemblConversion' '-L' '' UseNonTranscript 'UseNonTranscript' '-O' '' UseTrackName 'UseTrackName' '-t' '' WriteCoordinates 'WriteCoordinates' '-W' ''
input
— Input file nameInput file name, specified as a string or character vector. The file can be a GTF or GFF file.
Example: 'gyrAB.gtf'
Data Types: char
| string
output
— Output file nameOutput file name, specified as a string or character vector. By default, the output
is a GFF file. Set 'GTFOutput'
to true
to get a
GTF output file.
Example: 'gyrAB.gff'
Data Types: char
| string
opt
— cuffgffread
optionsCuffGFFReadOptions
object | string | character vectorcuffgffread
options, specified as a CuffGFFReadOptions
object, string, or character vector. The string or
character vector must be in the original gffread
option syntax
(prefixed by one or two dashes) [1].
Specify optional
comma-separated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
cuffgffread('gyrAB.gtf','gyrAB.gff','CoordinateRange','+NC_000912.1:4821..7340')
'AppendDescription'
— Flag to add file descriptions to descr
attributefalse
(default) | true
Flag to add file descriptions from sequence files to the
descr
attribute of the output GFF record, specified as
true
or false
. Specify the sequence files using the
SequenceInfo
option.
Example:
'AppendDescription',true
Data Types: logical
'CheckOppositeStrand'
— Flag to check opposite strand when checking for in-frame stop codonsfalse
(default) | true
Flag to check opposite strand when checking for in-frame stop codons, specified as true
or false
.
Example:
'CheckOppositeStrand',true
Data Types: logical
'CheckPhase'
— Flag to adjust coding sequence phasefalse
(default) | true
Flag to adjust coding sequence phase when checking for in-frame stop codons, specified as true
or false
.
Example:
'CheckPhase',true
Data Types: logical
'Cluster'
— Flag to cluster input transcripts into locitrue
(default) | false
Flag to cluster the input transcripts into loci, specified as
true
or false
. This option is the same as the
Merge
property, except that it does not collapse fully contained
transcripts with identical introns.
Example:
'Cluster',false
Data Types: logical
'CodingOnly'
— Flag to discard transcripts with no coding sequencefalse
(default) | true
Flag to discard transcripts with no coding sequence feature (CDS), specified as true
or false
.
Example:
'CodingOnly',true
Data Types: logical
'CollapseContainer'
— Flag to collapse fully contained transcriptsfalse
(default) | true
Flag to collapse fully contained transcripts that are shorter
with fewer introns than the container, specified as true
or
false
. This property applies only when you set Merge
to true
.
Example:
'CollapseContainer',true
Data Types: logical
'CollapseFull'
— Flag to collapse shorter transcripts overlapping at least 80% with another exonfalse
(default) | true
Flag to collapse shorter transcripts overlapping at least 80%
with another single exon transcript, specified as true
or
false
. This property applies only when you set Merge
to true
.
Example:
'CollapseFull',true
Data Types: logical
'CoordinateRange'
— Genomic range to filter transcriptsGenomic range to filter transcripts, specified as a string or character vector. The format must be "[[<strand>]<chr>:]<start>..<end>"
, where start
and end
are genomic positions, chr
is an optional chromosome or contig name, and an optional strand
('+'
or '-'
).
Example:
'CoordinateRange',“+NC_000912.1:4821..7340”
Data Types: char
| string
'DiscardInvalidCDS'
— Flag to ignore mRNA transcripts either lacking start or stop codon or having in-frame stop codonfalse
(default) | true
Flag to ignore mRNA transcripts either lacking a start or stop codon or having an in-frame stop codon, specified as true
or false
.
Example:
'DiscardInvalidCDS',true
Data Types: logical
'DiscardNonCanonicalSplice'
— Flag to ignore multiexon mRNA transcripts that have intron with noncanonical splice sequencefalse
(default) | true
Flag to ignore multiexon mRNA transcripts that have an intron
with a noncanonical splice sequence, specified as true
or
false
. A noncanonical splice sequence is any splice sequence other than
"GT-AG"
, "CG-AG"
, or
"AT-AC"
.
Example:
'DiscardNonCanonicalSplice',true
Data Types: logical
'DiscardSingleExon'
— Flag to ignore transcripts spanning single exonfalse
(default) | true
Flag to ignore transcripts spanning a single exon, specified as true
or false
.
Example:
'DiscardSingleExon',true
Data Types: logical
'DiscardTerminatedCDS'
— Flag to ignore transcripts with in-frame stop codonfalse
(default) | true
Flag to ignore transcripts with an in-frame stop codon, specified as true
or false
.
Example:
'DiscardTerminatedCDS',true
Data Types: logical
'ExtraCommand'
— Additional commands""
(default) | character vector | stringThe commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB properties.
Example: 'ExtraCommand',"-E"
Data Types: char
| string
'FastaCDSFile'
— Name of file to save spliced coding sequencesName of a file to save the spliced coding sequences in the FASTA format, specified as a string or character vector.
Example:
'FastaCDSFile',"splicedCoding.FASTA"
Data Types: char
| string
'FastaExonsFile'
— Name of file to save spliced exonsName of a file to save the spliced exons in the FASTA format, specified as a string or character vector.
Example:
'FastaExonsFile',"splicedExon.FASTA"
Data Types: char
| string
'FastaProteinFile'
— Name of file to save protein translation of coding sequencesName of a file to save the protein translation of coding sequences in the FASTA format, specified as a string or character vector.
Example:
'FastaProteinFile',"translated.FASTA"
Data Types: char
| string
'FirstExonOnly'
— Flag to parse additional attributes only from first exonfalse
(default) | true
Flag to parse additional attributes only from the first exon, specified as true
or false
.
Example: 'FirstExonOnly',true
Data Types: logical
'ForceExons'
— Flag to list lowest-level GFF features as exon featuresfalse
(default) | true
Flag to list the lowest-level GFF features as exon features in
the output file, specified as true
or false
.
Example:
'ForceExons',true
Data Types: logical
'FullyContained'
— Flag to discard transcripts not contained fullyfalse
(default) | true
Flag to discard transcripts not contained fully within the
range, specified as true
or false
. Specify the range using
the CoordinateRange
option.
Example:
'FullyContained',true
Data Types: logical
'GTFOutput'
— Flag to output GTF-format transcript filesfalse
(default) | true
Flag to output GTF-format transcript files, specified as
true
or false
.
Example:
'GTFOutput',true
Data Types: logical
'IncludeAll'
— Flag to apply all available optionsfalse
(default) | true
The original (native) syntax is prefixed by one or two dashes.
By default, the function converts only the specified options. If the value is
true
, the software converts all available options, with default values
for unspecified options, to the original syntax.
Note
If you set IncludeAll
to true
, the software
translates all available properties, with default values for unspecified properties. The
only exception is that when the default value of a property is NaN
,
Inf
, []
, ''
, or
""
, then the software does not translate the corresponding
property.
Example: 'IncludeAll',true
Data Types: logical
'MaxIntronLength'
— Maximum intron length for transcript to include in outputInf
(default) | positive integerMaximum intron length for a transcript to include in the output
file, specified as a positive integer. Inf
, the default value, sets no limit
on the intron length.
Example:
'MaxIntronLength',500
Data Types: double
'Merge'
— Flag to merge transcripts to locifalse
(default) | true
Flag to merge transcripts into loci by collapsing transcripts with identical introns, specified as true
or false
.
Example:
'Merge',true
Data Types: logical
'MergeCloseExons'
— Flag to merge exons into single exonfalse
(default) | true
Flag to merge exons into a single exon when separated by fewer than 4 base-pair introns, specified as true
or false
.
Example:
'MergeCloseExons',true
Data Types: logical
'MergeInfoFile'
— Name of file to save information on duplicates when mergingName of a file to save information on duplicates when merging,
specified as a string or character vector. This property applies only when you set
Merge
to true
.
Example:
'MergeInfoFile',"duplicates.txt"
Data Types: char
| string
'PreserveAttributes'
— Flag to retain all attributes in outputfalse
(default) | true
Flag to retain all attributes in the output file, specified as true
or false
.
Example:
'PreserveAttributes',true
Data Types: logical
'Pseudo'
— Flag to filter out records containing "pseudo"true
(default) | false
Flag to filter out records containing the word "pseudo,"
specified as true
or false
.
Example:
'Pseudo',false
Data Types: logical
'ReplacementTable'
— Name of file containing replacement tableName of a file containing a replacement table, specified as a string or character vector. The table must have two columns, where the first column contains the original transcript IDs and the second column contains the new transcript IDs. An example table follows.
origTranscript1 | newTranscript1 |
origTranscript2 | newTranscript2 |
origTranscript3 | newTranscript3 |
If you provide a replacement table, the function replaces the transcript IDs found in the first column with the new transcripts IDs from the second column and filters out those transcripts not found.
Example:
'ReplacementTable',"replaceTbl.txt"
Data Types: char
| string
'SequenceFile'
— Name of FASTA-format file containing genomic sequencesName of a FASTA-format file containing genomic sequences for all input mappings, specified as a string or character vector.
Example:
'SequenceFile',"seqs.fasta"
Data Types: char
| string
'SequenceInfo'
— Name of tab-delimited file with additional information on input sequenceName of a tab-delimited file with additional information on
each input sequence, specified as a string or character vector. This file must have three
columns: a sequence name column, a sequence length column, and a sequence description column. If
AppendDescription
is true
, the sequence description
is included as an attribute in the output GFF file.
Example:
'SequenceInfo',"seqinfo.txt"
Data Types: char
| string
'UrlDecode'
— Flag to decode URL-encoded characters in attribute namesfalse
(default) | true
Flag to decode url-encoded characters in attribute names,
specified as true
or false
. For instance,
"transcript%20description" is decoded to "transcript description".
Example:
'UrlDecode',true
Data Types: logical
'UseEnsemblConversion'
— Flag to use GTF-to-GFF3 conversion method from Ensemblfalse
(default) | true
Flag to use the GTF-to-GFF3 conversion method from Ensembl, specified as true
or false
.
Example:
'UseEnsemblConversion',true
Data Types: logical
'UseNonTranscript'
— Flag to include nontranscript GFF records in output filefalse
(default) | true
Flag to include nontranscript GFF records in the output file, specified as true
or false
.
Example:
'UseNonTranscript',true
Data Types: logical
'UseTrackName'
— Flag to use track name in second column of GFF output linefalse
(default) | true
Flag to use the track name in the second column of the GFF output line, specified as true
or false
.
Example:
'UseTrackName',true
Data Types: logical
'WriteCoordinates'
— Flag to write exon coordinates projected onto spliced sequencefalse
(default) | true
Flag to write the exon coordinates projected onto the spliced
sequence, specified as true
or false
. This property
applies only when FastaExonsFile
or FastaCDSFile
is
specified.
Example:
'WriteCoordinates',true
Data Types: logical
[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.