Get Information from Web Database

What Are get Functions?

Bioinformatics Toolbox™ includes several get functions that retrieve information from various Web databases. Additionally, with some basic MATLAB® programming skills, you can create your own get function to retrieve information from a specific Web database.

The following procedure illustrates how to create a function to retrieve information from the NCBI PubMed database and read the information into a MATLAB structure. The NCBI PubMed database contains biomedical literature citations and abstracts.

Creating the getpubmed Function

The following procedure shows you how to create a function named getpubmed using the MATLAB Editor. This function will retrieve citation and abstract information from PubMed literature searches and write the data to a MATLAB structure.

Specifically, this function will take one or more search terms, submit them to the PubMed database for a search, then return a MATLAB structure or structure array, with each structure containing information for an article found by the search. The returned information will include a PubMed identifier, publication date, title, abstract, authors, and citation.

The function will also include property name-value pairs that let the user of the function limit the search by publication date and limit the number of records returned. Below is the step-by-step guide to create the function from the beginning. To see the completed m-file, type edit getpubmed.m.

  1. From MATLAB, open the MATLAB Editor by selecting File > New > Function.

  2. Define the getpubmed function, its input arguments, and return values by typing:

    function pmstruct = getpubmed(searchterm,varargin)
    % GETPUBMED Search PubMed database & write results to MATLAB structure
  3. Add code to do some basic error checking for the required input SEARCHTERM.

    % Error checking for required input SEARCHTERM
    if(nargin<1)
        error(message('bioinfo:getpubmed:NotEnoughInputArguments'));
    end
  4. Create variables for the two property name-value pairs, and set their default values.

    % Set default settings for property name/value pairs, 
    % 'NUMBEROFRECORDS' and 'DATEOFPUBLICATION'
    maxnum = 50; % NUMBEROFRECORDS default is 50
    pubdate = ''; % DATEOFPUBLICATION default is an empty string
  5. Add code to parse the two property name-value pairs if provided as input.

    % Parsing the property name/value pairs 
    num_argin = numel(varargin);
    for n = 1:2:num_argin
        arg = varargin{n};
        switch lower(arg)
            
            % If NUMBEROFRECORDS is passed, set MAXNUM
            case 'numberofrecords'
                maxnum = varargin{n+1};
            
            % If DATEOFPUBLICATION is passed, set PUBDATE
            case 'dateofpublication'
                pubdate = varargin{n+1};          
                
        end     
    end
  6. You access the PubMed database through a search URL, which submits a search term and options, and then returns the search results in a specified format. This search URL is comprised of a base URL and defined parameters. Create a variable containing the base URL of the PubMed database on the NCBI Web site.

    % Create base URL for PubMed db site
    baseSearchURL = 'https://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search';
  7. Create variables to contain five defined parameters that the getpubmed function will use, namely, db (database), term (search term), report (report type, such as MEDLINE®), format (format type, such as text), and dispmax (maximum number of records to display).

    % Set db parameter to pubmed
    dbOpt = '&db=pubmed';
    
    % Set term parameter to SEARCHTERM and PUBDATE 
    % (Default PUBDATE is '')
    termOpt = ['&term=',searchterm,'+AND+',pubdate];
    
    % Set report parameter to medline
    reportOpt = '&report=medline';
    
    % Set format parameter to text
    formatOpt = '&format=text';
    
    % Set dispmax to MAXNUM 
    % (Default MAXNUM is 50)
    maxOpt = ['&dispmax=',num2str(maxnum)];
    
  8. Create a variable containing the search URL from the variables created in the previous steps.

    % Create search URL
    searchURL = [baseSearchURL,dbOpt,termOpt,reportOpt,formatOpt,maxOpt];
  9. Use the urlread function to submit the search URL, retrieve the search results, and return the results (as text in the MEDLINE report type) in medlineText, a character array.

    medlineText = urlread(searchURL);
    
  10. Use the MATLAB regexp function and regular expressions to parse and extract the information in medlineText into hits, a cell array, where each cell contains the MEDLINE-formatted text for one article. The first input is the character array to search, the second input is a search expression, which tells the regexp function to find all records that start with PMID-, while the third input, 'match', tells the regexp function to return the actual records, rather than the positions of the records.

    hits = regexp(medlineText,'PMID-.*?(?=PMID|</pre>$)','match');
    
  11. Instantiate the pmstruct structure returned by getpubmed to contain six fields.

    pmstruct = struct('PubMedID','','PublicationDate','','Title','',...
                 'Abstract','','Authors','','Citation','');
    
  12. Use the MATLAB regexp function and regular expressions to loop through each article in hits and extract the PubMed ID, publication date, title, abstract, authors, and citation. Place this information in the pmstruct structure array.

    for n = 1:numel(hits)
        pmstruct(n).PubMedID = regexp(hits{n},'(?<=PMID- ).*?(?=\n)','match', 'once');
        pmstruct(n).PublicationDate = regexp(hits{n},'(?<=DP  - ).*?(?=\n)','match', 'once');
        pmstruct(n).Title = regexp(hits{n},'(?<=TI  - ).*?(?=PG  -|AB  -)','match', 'once');
        pmstruct(n).Abstract = regexp(hits{n},'(?<=AB  - ).*?(?=AD  -)','match', 'once');
        pmstruct(n).Authors = regexp(hits{n},'(?<=AU  - ).*?(?=\n)','match');
        pmstruct(n).Citation = regexp(hits{n},'(?<=SO  - ).*?(?=\n)','match', 'once');
    end
  13. Select File > Save As.

    When you are done, your file should look similar to the getpubmed.m file included with the Bioinformatics Toolbox software. The file is located at:

    matlabroot\toolbox\bioinfo\biodemos\getpubmed.m

    Note

    The notation matlabroot is the MATLAB root directory, which is the directory where the MATLAB software is installed on your system.