fastaread

Read data from FASTA file

Syntax

FASTAData = fastaread(File)
[Header, Sequence] = fastaread(File)
... = fastaread(File, ...'IgnoreGaps', IgnoreGapsValue, ...)
... = fastaread(File, ...'Blockread', BlockreadValue, ...)
... = fastaread(File, ...'TrimHeaders', TrimHeadersValue, ...)

Input Arguments

File

Either of the following:

  • Character vector or string specifying a file name, a path and file name, or a URL pointing to a file. The referenced file is a FASTA-formatted file (ASCII text file). If you specify only a file name, that file must be on the MATLAB® search path or in the MATLAB Current Folder.

  • MATLAB character array that contains the text of a FASTA-formatted file.

IgnoreGapsValueControls the removal of gap symbols. Choices are true or false (default).
BlockreadValueScalar or vector that controls the reading of a single sequence entry or block of sequence entries from a FASTA-formatted file containing multiple sequences. Enter a scalar N to read the Nth entry in the file. Enter a 1-by-2 vector [M1, M2] to read the block of entries starting at the M1 entry and ending at the M2 entry. To read all remaining entries in the file starting at the M1 entry, enter a positive value for M1 and enter Inf for M2.
TrimHeadersValue

Specifies whether to trim the header after the first white space character. White space characters include a space (char(32)) and a tab (char(9)). Choices are true or false (default).

Output Arguments

FASTADataMATLAB structure with the fields Header and Sequence.

Description

fastaread reads data from a FASTA-formatted file into a MATLAB structure with the following fields.

FieldDescription
HeaderHeader information.
SequenceSingle letter-code representation of a nucleotide sequence.

A FASTA-formatted file begins with a right angle bracket (>) and a single line description. Following this description is the sequence as a series of lines with fewer than 80 characters. Sequences must use the standard IUB/IUPAC amino acid and nucleotide letter codes.

For a list of codes, see aminolookup and baselookup.

FASTAData = fastaread(File) reads a FASTA-formatted file and returns the data in a structure. FASTAData.Header is the header information, while FASTAData.Sequence is the sequence stored as a character vector or string.

[Header, Sequence] = fastaread(File) reads data from a file into separate variables. If the file contains multiple sequences, then Header and Sequence are cell arrays of header and sequence information.

... = fastaread(File, ...'PropertyName', PropertyValue, ...) calls fastaread with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. The property name/value pairs can be in any format supported by the function set (for example, name-value pairs and structures). These property name/property value pairs are as follows:

... = fastaread(File, ...'IgnoreGaps', IgnoreGapsValue, ...), when IgnoreGapsValue is true, removes any gap symbol ('-' or '.') from the sequences. Default is false.

... = fastaread(File, ...'Blockread', BlockreadValue, ...) lets you read in a single sequence entry or block of sequence entries from a file containing multiple sequences. If BlockreadValue is a scalar N, then fastaread reads the Nth entry in the file. If BlockreadValue is a 1-by-2 vector [M1, M2], then fastaread reads the block of entries starting at the M1 entry and ending at the M2 entry. To read all remaining entries in the file starting at the M1 entry, enter a positive value for M1 and enter Inf for M2.

... = fastaread(File, ...'TrimHeaders', TrimHeadersValue, ...) specifies whether to trim the header to the first white space.

Examples

collapse all

Read the nucleotide sequence information of the human p53 tumor gene.

p53nt = fastaread('p53nt.txt')
p53nt = struct with fields:
      Header: 'gi|8400737|ref|NM_000546.2| Homo sapiens tumor protein p53 (Li-Fraumeni syndrome) (TP53), mRNA'
    Sequence: 'ACTTGTCATGGCGACTGTCCAGCTTTGTGCCAGGAGCCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCGCGTGGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCTGTCCCTTCCCAGAAAACCTACCAGGGCAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCACGTACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCAGCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACACTTTTCGACATAGTGTGGTGGTGCCCTATGAGCCGCCTGAGGTTGGCTCTGACTGTACCACCATCCACTACAACTACATGTGTAACAGTTCCTGCATGGGCGGCATGAACCGGAGGCCCATCCTCACCATCATCACACTGGAAGACTCCAGTGGTAATCTACTGGGACGGAACAGCTTTGAGGTGCGTGTTTGTGCCTGTCCTGGGAGAGACCGGCGCACAGAGGAAGAGAATCTCCGCAAGAAAGGGGAGCCTCACCACGAGCTGCCCCCAGGGAGCACTAAGCGAGCACTGCCCAACAACACCAGCTCCTCTCCCCAGCCAAAGAAGAAACCACTGGATGGAGAATATTTCACCCTTCAGATCCGTGGGCGTGAGCGCTTCGAGATGTTCCGAGAGCTGAATGAGGCCTTGGAACTCAAGGATGCCCAGGCTGGGAAGGAGCCAGGGGGGAGCAGGGCTCACTCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCATGTTCAAGACAGAAGGGCCTGACTCAGACTGACATTCTCCACTTCTTGTTCCCCACTGACAGCCTCCCACCCCCATCTCTCCCTCCCCTGCCATTTTGGGTTTTGGGTCTTTGAACCCTTGCTTGCAATAGGTGTGCGTCAGAAGCACCCAGGACTTCCATTTGCTTTGTCCCGGGGCTCCACTGAACAAGTTGGCCTGCACTGGTGTTTTGTTGTGGGGAGGAGGATGGGGAGTAGGACATACCAGCTTAGATTTTAAGGTTTTTACTGTGAGGGATGTTTGGGAGATGTAAGAAATGTTCTTGCAGTTAAGGGTTAGTTTACAATCAGCCACATTCTAGGTAGGTAGGGGCCCACTTCACCGTACTAACCAGGGAAGCTGTCCCTCATGTTGAATTTTCTCTAACTTCAAGGCCCATATCTGTGAAATGCTGGCATTTGCACCTACCTCACAGAGTGCATTGTGAGGGTTAATGAAATAATGTACATCTGGCCTTGAAACCACCTTTTATTACATGGGGTCTAAAACTTGACCCCCTTGAGGGTGCCTGTTCCCTCTCCCTCTCCCTGTTGGCTGGTGGGTTGGTAGTTTCTACAGTTGGGCAGCTGGTTAGGTAGAGGGAGTTGTCAAGTCTTGCTGGCCCAGCCAAACCCTGTCTGACAACCTCTTGGTCGACCTTAGTACCTAAAAGGAAATCTCACCCCATCCCACACCCTGGAGGATTTCATCTCTTGTATATGATGATCTGGATCCACCAAGACTTGTTTTATGCTCAGGGTCAATTTCTTTTTTCTTTTTTTTTTTTTTTTTTCTTTTTCTTTGAGACTGGGTCTCGCTTTGTTGCCCAGGCTGGAGTGGAGTGGCGTGATCTTGGCTTACTGCAGCCTTTGCCTCCCCGGCTCGAGCAGTCCTGCCTCAGCCTCCGGAGTAGCTGGGACCACAGGTTCATGCCACCATGGCCAGCCAACTTTTGCATGTTTTGTAGAGATGGGGTCTCACAGTGTTGCCCAGGCTGGTCTCAAACTCCTGGGCTCAGGCGATCCACCTGTCTCAGCCTCCCAGAGTGCTGGGATTACAATTGTGAGCCACCACGTGGAGCTGGAAGGGTCAACATCTTTTACATTCTGCAAGCACATCTGCATTTTCACCCCACCCTTCCCCTCCTTCTCCCTTTTTATATCCCATTTTTATATCGATCTCTTATTTTACAATAAAACTTTGCTGCCA'

Read the amino acid sequence information of p53 protein.

p53aa = fastaread('p53aa.txt')
p53aa = struct with fields:
      Header: 'gi|8400738|ref|NP_000537.2| tumor protein p53 [Homo sapiens]'
    Sequence: 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD'

Read a block of entries from a FASTA file.

pf2_5_10 = fastaread('pf00002.fa', 'blockread', [5 10], ...
                     'ignoregaps',true)
pf2_5_10=6×1 struct array with fields:
    Header
    Sequence

Introduced before R2006a