Exploring Microarray Gene Expression Data

This example shows how to identify differentially expressed genes from microarray data and uses Gene Ontology to determine significant biological functions that are associated to the down- and up-regulated genes.

Introduction

Microarrays contain oligonucleotide or cDNA probes for comparing the expression profile of genes on a genomic scale. Determining if changes in gene expression are statistically significant between different conditions, e.g. two different tumor types, and determining the biological function of the differentially expressed genes, are important aims in a microarray experiment.

A publicly available dataset containing gene expression data of 42 tumor tissues of the embryonal central nervous system (CNS) [1] is used for this example. The CEL files can be downloaded here. The samples were hybridized on Affymetrix® HuGeneFL GeneChip® arrays. The raw dataset was preprocessed with the Robust Multi-array Average (RMA) and GC Robust Multi-array Average (GCRMA) procedures. For further information on Affymetrix oligonucleotide microarray preprocessing, see Preprocessing Affymetrix® Microarray Data at the Probe Level.

You will use the t-test and false discovery rate to detect differentially expressed genes between two tumor types. Additionally, you will look at Gene Ontology terms related to the significantly up-regulated genes.

Loading the Expression Data

Load the MAT file cnsexpressiondata containing three DataMatrix objects associated with the gene expression values preprocessed using RMA (expr_cns_rma), GCRMA with Maximum Likelihood Estimate (expr_cns_gcrma_mle), and GCRMA with Empirical-Bayes estimate (expr_cns_gcrma_eb).

load cnsexpressiondata

In each DataMatrix object, each row corresponds to a probe set on the array, and each column corresponds to a sample. The DataMatrix object expr_cns_gcrma_eb will be used in this example, but data from either one of the other two expression variables can be used as well.

Retrieve the properties of the DataMatrix object expr_cns_gcrma_eb using the get command.

get(expr_cns_gcrma_eb)
            Name: ''
        RowNames: {7129×1 cell}
        ColNames: {1×42 cell}
           NRows: 7129
           NCols: 42
           NDims: 2
    ElementClass: 'single'

Determine the number of genes and number of samples by accessing the number of rows and number of columns of the DataMatrix object respectively.

nGenes = expr_cns_gcrma_eb.NRows
nSamples = expr_cns_gcrma_eb.NCols
nGenes =

        7129


nSamples =

    42

A mapping between the probe set ID and the corresponding gene symbol is provided as Map object in the MAT file HuGeneFL_GeneSymbol_Map.

load HuGeneFL_GeneSymbol_Map

Annotate the expression values in expr_cns_gcrma_eb with the corresponding gene symbols by creating a cell array of gene symbols from the Map object and setting the row names of the Data Matrix object.

huGenes = values(hu6800GeneSymbolMap, expr_cns_gcrma_eb.RowNames);
expr_cns_gcrma_eb = rownames(expr_cns_gcrma_eb, ':', huGenes);

Filtering the Expression Data

Many probe sets in this example are not annotated, not expressed or have a small variability across samples. Use the following techniques to filter out these genes.

Remove gene expression data with empty gene symbols (in this example, the empty symbols are labeled as '---').

expr_cns_gcrma_eb('---', :) = [];

Use genelowvalfilter to filter out genes with very low absolute expression values.

[~, expr_cns_gcrma_eb] = genelowvalfilter(expr_cns_gcrma_eb);

Use genevarfilter to filter out genes with a small variance across samples.

[~, expr_cns_gcrma_eb] = genevarfilter(expr_cns_gcrma_eb);

Determine the number of genes after filtering.

nGenes = expr_cns_gcrma_eb.NRows
nGenes =

        5669

Identifying Differential Gene Expression

You can now compare the gene expression values between two groups of data: CNS medulloblastomas (MD) and non-neuronal origin malignant gliomas (Mglio) tumor.

From the expression data of all 42 samples in the dataset, extract the data of the 10 MD samples and the 10 Mglio samples.

MDs = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MD', 8);
Mglios = strncmp(expr_cns_gcrma_eb.ColNames,'Brain_MGlio', 11);

MDData = expr_cns_gcrma_eb(:, MDs);
get(MDData)

MglioData = expr_cns_gcrma_eb(:, Mglios);
get(MglioData)
            Name: ''
        RowNames: {5669×1 cell}
        ColNames: {1×10 cell}
           NRows: 5669
           NCols: 10
           NDims: 2
    ElementClass: 'single'

            Name: ''
        RowNames: {5669×1 cell}
        ColNames: {1×10 cell}
           NRows: 5669
           NCols: 10
           NDims: 2
    ElementClass: 'single'

Conduct a t-test for each gene to identify significant changes in expression values between the MD samples and Mglio samples. You can inspect the test results from the normal quantile plot of t-scores and the histograms of t-scores and p-values of the t-tests.

[pvalues, tscores] = mattest(MDData, MglioData,...
                                'Showhist', true', 'Showplot', true);

In any test situation, two types of errors can occur, a false positive by declaring that a gene is differentially expressed when it is not, and a false negative when the test fails to identify a truly differentially expressed gene. In multiple hypothesis testing, which simultaneously tests the null hypothesis of thousands of genes, each test has a specific false positive rate, or a false discovery rate (FDR). False discovery rate is defined as the expected ratio of the number of false positives to the total number of positive calls in a differential expression analysis between two groups of samples [2].

In this example, you will compute the FDR using the Storey-Tibshirani procedure [2]. The procedure also computes the q-value of a test, which measures the minimum FDR that occurs when calling the test significant. The estimation of FDR depends on the truly null distribution of the multiple tests, which is unknown. Permutation methods can be used to estimate the truly null distribution of the test statistics by permuting the columns of the gene expression data matrix [2][3]. Depending on the sample size, it may not be feasible to consider all possible permutations. Usually a random subset of permutations are considered in the case of large sample size. Use the nchoosek function in Statistics and Machine Learning Toolbox™ to find out the number of all possible permutations of the samples in this example.

all_possible_perms = nchoosek(1:MDData.NCols+MglioData.NCols, MDData.NCols);
size(all_possible_perms, 1)
ans =

      184756

Perform a permutation t-test using mattest and the PERMUTE option to compute the p-values of 10,000 permutations by permuting the columns of the gene expression data matrix of MDData and MglioData [3].

pvaluesCorr = mattest(MDData, MglioData, 'Permute', 10000);

Determine the number of genes considered to have statistical significance at the p-value cutoff of 0.05. Note: You may get a different number of genes due to the permutation test outcome.

cutoff = 0.05;
sum(pvaluesCorr < cutoff)
ans =

        2121

Estimate the FDR and q-values for each test using mafdr. The quantity pi0 is the overall proportion of true null hypotheses in the study. It is estimated from the simulated null distribution via bootstrap or the cubic polynomial fit. Note: You can also manually set the value of lambda for estimating pi0.

figure;
[pFDR, qvalues] = mafdr(pvaluesCorr, 'showplot', true);

Determine the number of genes that have q-values less than the cutoff value. Note: You may get a different number of genes due to the permutation test and the bootstrap outcomes.

sum(qvalues < cutoff)
ans =

        2173

Many genes with low FDR implies that the two groups, MD and Mglio, are biologically distinct.

You can also empirically estimate the FDR adjusted p-values using the Benjamini-Hochberg (BH) procedure [4] by setting the mafdr input parameter BHFDR to true.

pvaluesBH = mafdr(pvaluesCorr, 'BHFDR', true);
sum(pvaluesBH < cutoff)
ans =

        1374

You can store the t-scores, p-values, pFDRs, q-values and BH FDR corrected p-values together as a DataMatrix object.

testResults = [tscores pvaluesCorr pFDR qvalues pvaluesBH];

Update the column name for BH FDR corrected p-values using the colnames method of DataMatrix object.

testResults = colnames(testResults, 5, {'FDR_BH'});

You can sort by p-values pvaluesCorr using the sortrows mathod.

testResults = sortrows(testResults, 2);

Display the first 20 genes in testResults. Note: Your results may be different from those shown below due to the permutation test and the bootstrap outcomes.

testResults(1:20, :)
ans = 

                t-scores    p-values      FDR           q-values      FDR_BH    
    PLEC1       -9.6223     6.7194e-09    1.3675e-05     7.171e-06    1.9974e-05
    HNRPA1        9.359      1.382e-08    1.4063e-05     7.171e-06    1.9974e-05
    FCGR2A      -9.3548      1.394e-08     9.457e-06     7.171e-06    1.9974e-05
    PLEC1       -9.3495     1.4094e-08     7.171e-06     7.171e-06    1.9974e-05
    FBL          9.1518     1.9875e-08    8.0899e-06    7.1728e-06     1.998e-05
    KIAA0367     -8.996     2.4324e-08    8.2509e-06    7.1728e-06     1.998e-05
    ID2B        -8.9285     2.6667e-08    7.7533e-06    7.1728e-06     1.998e-05
    RBMX         8.8905     2.8195e-08    7.1728e-06    7.1728e-06     1.998e-05
    PAFAH1B3     8.7561     3.5317e-08    7.9864e-06    7.9864e-06    2.2246e-05
    H3F3A        8.6512     4.5191e-08    9.1973e-06    8.5559e-06    2.3832e-05
    LRP1        -8.6465     4.6243e-08    8.5559e-06    8.5559e-06    2.3832e-05
    PEA15       -8.3256     1.1419e-07    1.9367e-05    1.9367e-05    5.3947e-05
    ID2B        -8.1183     1.7041e-07    2.6679e-05    2.4793e-05    6.9059e-05
    SFRS3        8.1166     1.7055e-07    2.4793e-05    2.4793e-05    6.9059e-05
    HLA-DPA1    -7.8546     2.4004e-07    3.2569e-05    3.2569e-05     9.072e-05
    C5orf13      7.7195     2.9229e-07    3.7179e-05    3.3475e-05    9.3243e-05
    PTMA         7.7013     2.9658e-07    3.5506e-05    3.3475e-05    9.3243e-05
    NAP1L1        7.674     3.0489e-07    3.4474e-05    3.3475e-05    9.3243e-05
    HMGB2        7.6532     3.1251e-07    3.3475e-05    3.3475e-05    9.3243e-05
    RAB31       -13.664      3.308e-07    3.3662e-05    3.3662e-05    9.3766e-05

A gene is considered to be differentially expressed between the two groups of samples if it shows both statistical and biological significance. This example compares the gene expression ratio of MD over Mglio tumor samples. Therefore an up-regulated gene in this example has higher expression in MD, and down-regulated gene has higher expression in Mglio.

Plot the -log10 of p-values against the biological effect in a volcano plot. Note: From the volcano plot UI, you can interactively change the p-value cutoff and fold change limit, and export differentially expressed genes.

diffStruct = mavolcanoplot(MDData, MglioData, pvaluesCorr)
diffStruct = 

  struct with fields:

           Name: 'Differentially Expressed'
       PVCutoff: 0.0500
    FCThreshold: 2
     GeneLabels: {327×1 cell}
        PValues: [327×1 bioma.data.DataMatrix]
    FoldChanges: [327×1 bioma.data.DataMatrix]

Ctrl-click genes in the gene lists to label the genes in the plot. As seen in the volcano plot, genes specific for neuronal based cerebella granule cells, such as ZIC and NEUROD, are found in the up-regulated gene list, while genes typical of the astrocytic and oligodendrocytic lineage and cell differentiation, such as SOX2, PEA15, and ID2B, are found in the down-regulated list.

Determine the number of differentially expressed genes.

nDiffGenes = diffStruct.PValues.NRows
nDiffGenes =

   327

In particular, determine the list of up-regulated genes and the list of down-regulated genes for MD compared to Mglio.

up_geneidx = find(diffStruct.FoldChanges > 0);
nUpGenes = length(up_geneidx)

down_geneidx = find(diffStruct.FoldChanges < 0);
nDownGenes = length(down_geneidx)
nUpGenes =

   225


nDownGenes =

   102

Annotating Up-Regulated Genes Using Gene Ontology

You can use Gene Ontology (GO) information to annotate the differentially expressed genes identified above. The GO annotation file (GAF) for human can be downloaded from Gene Ontology Current Annotations. For convenience, a map between the gene symbols and associated GO IDs relatively to the aspect field Function is included in the MAT file goa_human.

load goa_human

Alternatively, you can run the code below to download the Gene Ontology database with the latest annotations, read the downloaded Homo sapiens annotation file (name the file as goa_human.gaf), and create a mapping between the gene symbols and the associated GO terms.

% GO = geneont('live',true);
% HGann = goannotread('goa_human.gaf',...
%    'Aspect','F','Fields',{'DB_Object_Symbol','GOid'});
% HGmap = containers.Map();
% for i = 1:numel(HGann)
%     key = HGann(i).DB_Object_Symbol;
%     if isKey(HGmap,key)
%         HGmap(key) = [HGmap(key) HGann(i).GOid];
%     else
%         HGmap(key) = HGann(i).GOid;
%     end
% end

Find the indices of the up-regulated genes for Gene Ontology analysis.

up_genes = rownames(diffStruct.FoldChanges, up_geneidx);
huGenes = rownames(expr_cns_gcrma_eb);
for i = 1:nUpGenes
    up_geneidx(i) = find(strncmpi(huGenes, up_genes{i}, length(up_genes{i})), 1);
end

Not all the genes on the HuGeneFL chip are annotated. For every gene on the chip, see if it is annotated by comparing its gene symbol to the list of gene symbols from GO. Track the number of annotated genes and the number of up-regulated genes associated with each GO term. Note that data in public repositories is frequently curated and updated; therefore the results of this example might be slightly different when you use up-to-date datasets. It is also possible that you get warnings about invalid or obsolete IDs due to an updated Homo sapiens gene annotation file.

m = GO.Terms(end).id;        % gets the last term id
chipgenesCount = zeros(m,1); % a vector of GO term counts for the entire chip.
upgenesCount  = zeros(m,1);  % a vector of GO term counts for up-regulated genes.
for i = 1:length(huGenes)
    if isKey(HGmap,huGenes{i})
        goid = getrelatives(GO,HGmap(huGenes{i}));
        chipgenesCount(goid) = chipgenesCount(goid) + 1;
        if (any(i == up_geneidx))
            upgenesCount(goid) = upgenesCount(goid) + 1;
        end
    end
end

Determine the statistically significant GO terms using the hypergeometric probability distribution. For each GO term, a p-value is calculated representing the probability that the number of annotated genes associated with it could have been found by chance.

gopvalues = hygepdf(upgenesCount,max(chipgenesCount),...
                        max(upgenesCount),chipgenesCount);
[dummy, idx] = sort(gopvalues);

Report the top ten most significant GO terms as follows.

report = sprintf('GO Term     p-value     counts      definition\n');
for i = 1:10
    term = idx(i);
    report = sprintf('%s%s\t%-1.5f\t%3d / %3d\t%s...\n',...
             report, char(num2goid(term)), gopvalues(term),...
             upgenesCount(term), chipgenesCount(term),...
            GO(term).Term.definition(2:min(50,end)));
end
disp(report);
GO Term     p-value     counts      definition
GO:0005515	0.00000	131 / 3459	Interacting selectively and non-covalently with a...
GO:0044822	0.00000	 94 / 514	Interacting non-covalently with a poly(A) RNA, a ...
GO:0003723	0.00000	 95 / 611	Interacting selectively and non-covalently with a...
GO:0003729	0.00000	 82 / 460	Interacting selectively and non-covalently with m...
GO:0003735	0.00000	 54 / 159	The action of a molecule that contributes to the ...
GO:0019843	0.00000	 48 / 186	Interacting selectively and non-covalently with r...
GO:0008135	0.00000	 50 / 208	Functions during translation by interacting selec...
GO:0000049	0.00000	 47 / 188	Interacting selectively and non-covalently with t...
GO:0000498	0.00000	 46 / 179	Interacting selectively and non-covalently with r...
GO:0001069	0.00000	 46 / 179	Interacting selectively and non-covalently with a...

Select the GO terms related to specific molecule functions and build a sub-ontology that includes the ancestors of the terms. Visualize this ontology using the biograph function. You can color the graphs nodes according to their significance. In this example, the red nodes are the most significant, while the blue nodes are the least significant gene ontology terms. Note: The GO terms returned may differ from those shown due to the frequent update to the Homo sapiens gene annotation file.

fcnAncestors = GO(getancestors(GO,idx(1:5)));
[cm,acc,rels] = getmatrix(fcnAncestors);
BG = biograph(cm,get(fcnAncestors.Terms,'name'));

for i = 1:numel(acc)
    pval = gopvalues(acc(i));
    color = [(1-pval).^(1) pval.^(1/8) pval.^(1/8)];
    BG.Nodes(i).Color = color;
end
view(BG)

Finding the Differentially Expressed Genes in Pathways

You can query the pathway information of the differentially expressed genes from the KEGG pathway database through KEGG's Web Service.

Following are a few pathway maps with the genes in the up-regulated gene list highlighted:

Cell Cycle

Hedgehog Signaling pathway

mTor Signaling pathway

References

[1] Pomeroy, S.L., et al., "Prediction of central nervous system embryonal tumour outcome based on gene expression". Nature, 415(6870):436-42, 2001.

[2] Storey, J.D., and Tibshirani, R., "Statistical significance for genomewide studies", PNAS, 100(16):9440-5, 2003.

[3] Dudoit, S., Shaffer, J.P., and Boldrick, J.C., "Multiple hypothesis testing in microarray experiment", Statistical Science, 18(1):71-103, 2003.

[4] Benjamini, Y., and Hochberg, Y., "Controlling the false discovery rate: a practical and powerful approach to multiple testing", Journal of the Royal Statistical Society, Series B, 57(1):289-300, 1995.