Clinical Chemistry 47: 1350-1352, 2001;
(Clinical Chemistry. 2001;47:1350-1352.)
© 2001 American Association for Clinical Chemistry, Inc.
Analysis Issues for Gene Expression Array Data
Jae K. Leea
aDivision of Biostatistics and Epidemiology, Department of Health Evaluation Sciences, University of Virginia School of Medicine, PO Box 800717, Charlottesville, VA 22908. Fax 804-924-8437; e-mail jaeklee{at}virginia.edu.
 |
Introduction
|
|---|
Gene expression array technologies are rapidly emerging for use in various genome-wide studies in biology and medicine (1)(2)(3). Results from the recently completed sequencing of the human genome predict that the number of genes in the human genome is much smaller (
33 000 genes) than anticipated (4). This implies that the major complexity in biological mechanisms in humans lies in the synergistic effects among various genes. Therefore, to decipher the secrets of our life and, ultimately, to find the cures for many human diseases, biologists must deal with multiple genes and their interactive transcripts simultaneously. High-throughput biotechnologies, including gene chip approaches, will play an important role in these studies (5).
However, quality control over thousands of gene expression values and full utilization of the information from these high-throughput data are extremely difficult, and important issues in quality control and bioinformatic approaches have not been resolved (6)(7). A series of careful analyses on the variability of array instrumentation and on the statistical evaluation of gene expression intensities are required for reliable and consistent inference on gene chip data. Specifically, sources of error and their confidence levels on these high-throughput measurements need to be better understood because some can significantly alter our inferences and conclusions (8). The purpose of this report is to dispel three common misconceptions about array experiments.
 |
Myth 1: A Replicated Gene Chip Experiment Is Needed Only for Confirming Reproducibility
|
|---|
It often is believed that reproducibility of gene expression data is high enough to perform bioinformatic discovery without a replicated chip experiment and that duplicated array data are necessary only for confirmation of such a discovery, which may help publication in high-standard journals. Investigators who have successfully obtained resources for high-cost gene chip experiments must optimize their experimental strategy for their study goals within limited resources. Therefore, for an initial study, single-chip experiments for each condition often are considered most beneficial. This is, in fact, not true for several reasons, most notably because of a dramatically higher statistical power of discovery in replicated experiments. It is true that as long as a good quality-control procedure is implemented, the accuracy of gene expression on glass-based chips is extremely high, even compared with laborious Northern blot and other RNA expression techniques (9). Despite such accuracy in individual observations, array data are much more prone to numerous false-positive findings, fundamentally because of (a) an extremely large number of observations and (b) a very wide dynamic range of gene expression values obtained from gene-chip experiments (8). Advantages of replicated chip experiments can be observed as illustrated below. For example, using a type of 13 000-oligonucleotide chips, a gene expression study on a murine model for type I diabetes has recently been performed to compare the effects of cytokine mediation and treatment with the antiinflammatory agent lisofylline (LSF) in autoimmune diabetes (10). In this experiment, there were four biological conditions for comparison: (a) no treatment, (b) cytokine mediation, (c) LSF treatment, and (d) cytokine mediation and LSF treatment. There are then six pair combinations for comparison of the four biological factors in terms of fold-change findings; if one is not interested in all comparisons, she or he may need to investigate a smaller number of comparisons. Table 1
shows the statistical power and the experimental costs for typical single and replicate experiments in this case.
View this table:
[in this window]
[in a new window]
|
Table 1. Costs for both array and laboratory confirmation studies in single and replicate chip experiments and the corresponding significance of fold-change discoveries on gene expression data.
|
|
If the single- and duplicate-chip experiments in Table 1
are compared, although the chip cost for a single-chip experiment is one-half that of a duplicate-chip experiment, the laboratory confirmation of genes with twofold changes has to be performed on >1250 genes with 730 false positives in the single-chip experiment. We note that false-positive error rates can be observed from duplicate array data with the same biological and experimental conditions; the array study illustrated above, in fact, has been done with duplicate chips for each condition, from which these error rates could be evaluated for each comparison. By comparison, in the duplicate-chip experiment, a statistical cross-validation is possible from the replicates, producing a very small false-positive error rate (25 of 210 findings) for the follow-up laboratory confirmation. Another important advantage of replicated experiments is to increase the sensitivity of a bioinformatic discovery. For example, in a triplicate-chip experiment (Table 1
), we can achieve a statistical power similar to that of a duplicate-chip experiment for the discovery of low-fold-change genes (1.6-fold change), which might be more relevant to core biological mechanisms of certain diseases than those with large-fold changes. The current standard of array experiments of leading pharmaceutical companies is 58 replicates for each condition (11).
 |
Myth 2: Chip Experiments Can Be Done without a Statistical Design
|
|---|
As in the above example, many current array applications try to compare expression changes within a small number of biological factors. In these cases, an experimental design is often considered trivial, not requiring any careful assignment of error factors. This may not be true because many instrumental and experimental error sources are inevitably involved in array experiments. For example, in a gene chip experiment there are at least five different sources of variability in gene expression data: (a) the gene, (b) variety (types of sample, treatment, time, and so forth), (c) individual sample preparation, (d) dye, and (e) chip hybridization. Some uninteresting error variances, such as those from sample preparation, dye, and chip hybridization, can be confounded with our interesting biological variability of gene expression, which may lead to misleading data interpretation. To estimate the variability from these different sources separately, one should perform an array experiment that deliberately provides relevant statistical observations, i.e., replicates for each error component. Without an optimal experimental design strategy, this can easily increase the cost of an array experiment unrealistically. This can, however, be circumvented by explicitly factoring out these uninteresting error components and by solely estimating the biological variability of gene expression. In other words, careful assignment of the error factors is needed based on blocking of the uninteresting error (12). An optimal design for a small array study can be relatively easily derived. For example, suppose one tried to compare a treatment (two treated individuals, I1 and I2) with a control (Ref) at two different time points (T1 and T2). Then the following design can effectively provide the statistical power for the two factors of interest, treatment and time point. As shown in Table 2
, treatment and baseline (Ref) samples can be alternatively assigned to different dyes within each individual and between two individuals; likewise, the two time points can be assigned deliberately to avoid confounding with dye and individual effects, which may otherwise have been confounded with these uninteresting error terms.
View this table:
[in this window]
[in a new window]
|
Table 2. A cDNA microarray experimental design1 for two biological factors of interest: Treatment [treated (I1, I2) and untreated (Ref) individual RNA samples] and time (T1 and T2).
|
|
 |
Myth 3: Experimental Confirmation Is the Only Way to Validate Findings
|
|---|
Various approaches to analysis have been applied to gene expression data. Most notably, clustering approaches have been found to be efficient in summarizing thousands of gene expression values based on their associations and in identifying patterns of their coexpression. The success of this method is attributed to several factors. In particular, a clustering algorithm effectively performs dimension reduction on extremely high-dimensional expression data into a two-dimensional space, organizing the data in a dendrogram. Furthermore, color-coded, side-by-side image maps that enable us to simultaneously screen thousands of gene expression values are useful for discerning interesting patterns of coexpression (1)(3). However, after obtaining certain findings from these bioinformatic tools, researchers often are tempted to directly perform a confirmation study for all the findings at the laboratory. For reasons similar to those described in Myth 1, without a statistical validation technique, an enormous number of false-positive findings occur. Therefore, to avoid an exhaustive search for real biological findings, a statistical confirmation strategy must be sought beforehand in all data exploration and analysis approaches. The statistical significance of each finding can be evaluated as long as a valid set of replicates is available, although different statistical inference strategies may be required for various array study goals. For example, validation by statistical resampling can eliminate many uninteresting expression patterns or clusters that can be observed by random chance from a single set in large screening data (13). Simultaneously reducing false-positive and -negative error rates from these statistical explorations, one must discover and validate important observations from array data with high statistical significance for further biological investigation (14)(15).
Expectations and hopes to decipher many complex biological mechanisms in humans and other living organisms become higher with advances in modern biotechnology. Many researchers in genome sciences realize that the bottleneck in bioinformatic investigation is not in biotechnology itself, but mainly in computational and statistical exploration of enormous genome-wide data. An intensive computational approach, searching for all possible combinatorial cases, often fails in these kinds of genome-wide investigations, simply because of seemingly infinite possibilities in biology (5). Therefore, careful statistical data exploration, incorporating both biological and mathematical information, is needed to rigorously investigate high-throughput gene-expression data. The above considerations in gene chip studies indicate that replicated chip experiments are extremely useful not only for confirming reproducibility but also for bioinformatic discovery. The ideal replicates are from independent sample collections and RNA preparations, which will assure the reproducibility of our findings with biological variability.
There are certain cases that prohibit the replicated array experiment. In those cases, researchers need to be aware of and to identify the biological factor with the highest variability and the biological conditions that can serve as a control or reference sample, if possible, with other experimental techniques. These will play an import role in discovering and validating novel bioinformatic findings. Note that a cDNA microarray approach, two-color chip hybridization technology, has a great advantage in that a reference sample can be placed on each chip with the test sample. However, this requires more careful experimental design than the one-color chip technology because independent sources of error may arise from the presence of two or more RNA samples on each chip. In current genome studies, it is apparent that the challenges of gene array data pertain to both biological and statistical sciences, with the difficulties of the latter often termed as pitfalls in "multiple comparisons" and the "curse of high dimensionality" (16).
 |
References
|
|---|
-
Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998;95:14863-14868.[Abstract/Free Full Text]
-
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999;286:531-537.[Abstract/Free Full Text]
-
Scherf U, Ross DT, Waltham M, Smith LH, Lee JK, Kohn KW, et al. A cDNA microarray gene expression database for the molecular pharmacology of cancer. Nat Genet 2000;24:236-244.[ISI][Medline]
[Order article via Infotrieve]
-
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science 2001;291:1304-1351.[Abstract/Free Full Text]
-
Bittner M, Meltzer P, Trent J. Data analysis and integration: of steps and arrows. Nat Genet 1999;22:213-215.[ISI][Medline]
[Order article via Infotrieve]
-
Chen Y, Dougherty ER, Bittner ML. Ratio-bases decisions and the quantitative analysis of cDNA microarray images. Biomed Optics 1997;2:364-374.
-
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW. On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001;8:37-52.[ISI][Medline]
[Order article via Infotrieve]
-
Bassett DE, Eisen MB, Boguski MS. Gene expression informaticsits all in your mine. Nat Genet 1999;21:51-55.[ISI][Medline]
[Order article via Infotrieve]
-
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 1995;270:467-470.[Abstract/Free Full Text]
-
Chen M, Wu R, Wilmot B, Yang Z, Lee JK, Fox JW, Nadler JL. Cytokine- and lysofylline-induced gene profile in insulin-secreting ß-TC3 cells using GeneChip technology. Diabetes 2001;in press..
-
Tatsuoka K, Clark S, Ruan J, Gruben D, Brawner M, Knowlton R, et al. High throughput quality control and statistical analysis for microarrays. IPAM Functional Genomics Workshop [Presentation], October 1115, 2000, UCLA, Los Angeles, CA..
-
Milliken GA, Johnson DE. Analysis of messy data, Vol. I. 1984:473pp Van Nostrand Reinhold New York. .
-
Kerr K, Churchill G. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments [Technical Report]. 2001 The Jackson Laboratory Bar Harbor, ME. .
-
Searle SR, Casella G, McCulloch CE. Variance components. 1992:501pp John Wiley & Sons New York. .
-
Lee MT, Kuo FC, Whitemore GA, Sklar J. Importance of replication in microarray expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S A 2000;97:9834-9839.[Abstract/Free Full Text]
-
Weinstein JN. Fishing expeditions. Science 1998;282:628-629.
The following articles in journals at HighWire Press have cited this article:

|
 |

|
 |
 
S. Bhattacharya and T. J. Mariani
Transformation of expression intensities across generations of Affymetrix microarrays using sequence matching and regression modeling
Nucleic Acids Res.,
October 13, 2005;
33(18):
e157 - e157.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A. K. Agarwal, P. D. Rogers, S. R. Baerson, M. R. Jacob, K. S. Barker, J. D. Cleary, L. A. Walker, D. G. Nagle, and A. M. Clark
Genome-wide Expression Profiling of the Response to Polyene, Pyrimidine, Azole, and Echinocandin Antifungal Agents in Saccharomyces cerevisiae
J. Biol. Chem.,
September 12, 2003;
278(37):
34998 - 35015.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A. R. Lankford, A. M. Byford, K. J. Ashton, B. A. French, J. K. Lee, J. P. Headrick, and G. P. Matherne
Gene expression profile of mouse myocardium with transgenic overexpression of A1 adenosine receptors
Physiol Genomics,
October 29, 2002;
11(2):
81 - 89.
[Abstract]
[Full Text]
[PDF]
|
 |
|