Bioinformatics Toolbox™ includes several get functions that retrieve information from various Web databases. Additionally, with some basic MATLAB® programming skills, you can create your own get function to retrieve information from a specific Web database.
The following procedure illustrates how to create a function to retrieve information from the NCBI PubMed database and read the information into a MATLAB structure. The NCBI PubMed database contains biomedical literature citations and abstracts.
The following procedure shows you how to create a function named getpubmed
using
the MATLAB Editor. This function will retrieve citation and abstract
information from PubMed literature searches and write the data to
a MATLAB structure.
Specifically, this function will take one or more search terms, submit them to the PubMed database for a search, then return a MATLAB structure or structure array, with each structure containing information for an article found by the search. The returned information will include a PubMed identifier, publication date, title, abstract, authors, and citation.
The function will also include property name-value pairs that
let the user of the function limit the search by publication date
and limit the number of records returned. Below is the step-by-step
guide to create the function from the beginning. To see the completed
m-file, type edit getpubmed.m
.
From MATLAB, open the MATLAB Editor by selecting File > New > Function.
Define the getpubmed
function,
its input arguments, and return values by typing:
function pmstruct = getpubmed(searchterm,varargin) % GETPUBMED Search PubMed database & write results to MATLAB structure
Add code to do some basic error checking for the required
input SEARCHTERM
.
% Error checking for required input SEARCHTERM if(nargin<1) error(message('bioinfo:getpubmed:NotEnoughInputArguments')); end
Create variables for the two property name-value pairs, and set their default values.
% Set default settings for property name/value pairs, % 'NUMBEROFRECORDS' and 'DATEOFPUBLICATION' maxnum = 50; % NUMBEROFRECORDS default is 50 pubdate = ''; % DATEOFPUBLICATION default is an empty string
Add code to parse the two property name-value pairs if provided as input.
% Parsing the property name/value pairs num_argin = numel(varargin); for n = 1:2:num_argin arg = varargin{n}; switch lower(arg) % If NUMBEROFRECORDS is passed, set MAXNUM case 'numberofrecords' maxnum = varargin{n+1}; % If DATEOFPUBLICATION is passed, set PUBDATE case 'dateofpublication' pubdate = varargin{n+1}; end end
You access the PubMed database through a search URL, which submits a search term and options, and then returns the search results in a specified format. This search URL is comprised of a base URL and defined parameters. Create a variable containing the base URL of the PubMed database on the NCBI Web site.
% Create base URL for PubMed db site baseSearchURL = 'https://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search';
Create variables to contain five defined parameters
that the getpubmed
function will use, namely, db
(database), term (search term), report (report type, such as MEDLINE®),
format (format type, such as text), and dispmax (maximum number of
records to display).
% Set db parameter to pubmed dbOpt = '&db=pubmed'; % Set term parameter to SEARCHTERM and PUBDATE % (Default PUBDATE is '') termOpt = ['&term=',searchterm,'+AND+',pubdate]; % Set report parameter to medline reportOpt = '&report=medline'; % Set format parameter to text formatOpt = '&format=text'; % Set dispmax to MAXNUM % (Default MAXNUM is 50) maxOpt = ['&dispmax=',num2str(maxnum)];
Create a variable containing the search URL from the variables created in the previous steps.
% Create search URL
searchURL = [baseSearchURL,dbOpt,termOpt,reportOpt,formatOpt,maxOpt];
Use the urlread
function to submit
the search URL, retrieve the search results, and return the results
(as text in the MEDLINE report type) in medlineText
,
a character array.
medlineText = urlread(searchURL);
Use the MATLAB regexp
function
and regular expressions to parse and extract the information in medlineText
into hits
,
a cell array, where each cell contains the MEDLINE-formatted
text for one article. The first input is the character array to search,
the second input is a search expression, which tells the regexp
function
to find all records that start with PMID-
, while
the third input, 'match'
, tells the regexp
function
to return the actual records, rather than the positions of the records.
hits = regexp(medlineText,'PMID-.*?(?=PMID|</pre>$)','match');
Instantiate the pmstruct
structure
returned by getpubmed
to contain six fields.
pmstruct = struct('PubMedID','','PublicationDate','','Title','',... 'Abstract','','Authors','','Citation','');
Use the MATLAB regexp
function
and regular expressions to loop through each article in hits
and
extract the PubMed ID, publication date, title, abstract, authors,
and citation. Place this information in the pmstruct
structure
array.
for n = 1:numel(hits) pmstruct(n).PubMedID = regexp(hits{n},'(?<=PMID- ).*?(?=\n)','match', 'once'); pmstruct(n).PublicationDate = regexp(hits{n},'(?<=DP - ).*?(?=\n)','match', 'once'); pmstruct(n).Title = regexp(hits{n},'(?<=TI - ).*?(?=PG -|AB -)','match', 'once'); pmstruct(n).Abstract = regexp(hits{n},'(?<=AB - ).*?(?=AD -)','match', 'once'); pmstruct(n).Authors = regexp(hits{n},'(?<=AU - ).*?(?=\n)','match'); pmstruct(n).Citation = regexp(hits{n},'(?<=SO - ).*?(?=\n)','match', 'once'); end
Select File > Save As.
When you are done, your file should look similar to the getpubmed.m
file
included with the Bioinformatics Toolbox software. The file is
located at:
matlabroot\toolbox\bioinfo\biodemos\getpubmed.m
Note
The notation matlabroot
is the MATLAB root
directory, which is the directory where the MATLAB software is
installed on your system.