|
|
||||||||
Abstracts of Oak Ridge Posters |
1 Mayo Clinic, Rochester, MN;2 Invitrogen Corporation, Carlsbad, CA;3 IRoche Diagnostics Asia Pacific Pte Ltd., Singapore;
aaddress correspondence to this author at: Mayo Clinic, 315 Stabile Building, Rochester, MN 55904; fax 507-266-5193; e-mail klee.eric{at}mayo.edu)
A major research objective of NIH is to discover novel biomarkers that can improve cancer detection assays. Many biomarker candidates with apparent differential expression identified by high throughput genomic and proteomic experiments have been reported as candidates for novel cancer detection assays. Despite these many discoveries, few biomarkers have been validated and successfully translated into clinical tests, a situation that may be attributable to the extensive and costly experimental evaluation required to fully develop a candidate biomarker into a clinically useful assay. The use of information in addition to disease-state expression data may help to focus biomarker assay development projects more selectively and facilitate successful assay development and translation. We describe a bioinformatics and data mining method for evaluating diagnostic serum biomarker candidates by selecting genes and gene products that possess intrinsic protein localization and tissue expression properties.
The bioinformatics methods we describe are based on the assumption that candidate markers with diagnostic value have been identified previously. We defined diagnostic value as differential up-regulation or high expression of the biomarker in cancer tissue compared with benign tissue. Our experience (unpublished) has shown that detection of differentially down-regulated serum biomarker candidates is problematic because the serum biomarker candidates are usually present in low concentrations at which it is difficult to detect the absence of signal under the normal tissue background. Regardless of the definition of diagnostic value, the bioinformatics methods we describe are not contingent on discovering or confirming the informative value of the candidate biomarkers for differentiating healthy vs diseased states. These methods are designed to select candidate biomarkers with properties that make them more readily detectable in the biologically complex serum environment, in which secreted or extracellular proteins would be present at higher concentrations than proteins generally found inside of cells. Candidate biomarker sets are rapidly screened with in silico algorithms to predict which genes encode proteins that are secreted from the cell and thus are likely to be detectable in serum. Tissue-specific profiles of individual genes systematically generated from transcriptomic databases are used to select markers with expression patterns specific for the target diagnostic tissue type or to exclude markers without such expression patterns. Thus candidate markers with high signal-to-background expression distinguishable in serum are identified and provide a focused list for experimental validation.
The in silico protein localization prediction process consists of 4 publicly available (web-based) programs that identify secreted and membrane protein products and differentiate them from products that are localized to other subcellular compartments. The use of multiple prediction methods improves prediction of the expression patterns of secretory proteins that undergo signalpeptidemediated cotranslational translocation (CTT proteins) in the endoplasmic reticulum and are then transported to the cell surface for membrane incorporation or secretion (1). We used SignalP 3.0 (2) to specifically predict secretory proteins and TargetP 1.1 (3) to differentiate mitochondrial proteins from secretory proteins. The SignalP 3.0 D-score and the TargetP predictor were combined in a consensus prediction to select proteins for further processing. We then classified the selected CTT proteins as extracellular and membrane proteins with TMHMM 2.0 (transmembrane hidden Markov model) (4), a transmembrane (TM)-domain prediction program. CTT proteins having no predicted TM domains are classified as extracellular, and CTT proteins having 2 or more predicted TM domains are classified as membrane-associated proteins. We analyzed CTT proteins predicted to have a single TM domain by Phobius (5), a combined TM topology and signal-peptide prediction program. This program distinguishes proteins with N-terminal signal anchors (membrane-associated) from proteins with N-terminal signal peptides (extracellular). We combined the secreted protein prediction methods into a batch-processing pipeline for rapid screening of all candidate biomarkers.
We evaluated the secreted protein ab initio analysis pipeline on a set of 643 genes identified from the literature as possessing diagnostic value for prostate cancer (6). We performed secreted protein prediction on all National Center for Biotechnology Information (NCBI) Reference Sequence (7) transcript variants associated with the candidates to capture genes encoding multiple protein products with differing localizations. Of the 643 putative biomarkers, 176 (27%) were predicted to encode secreted proteins. To evaluate the accuracy of prediction methods, we obtained protein records from the SwissProt protein sequence database (8) for candidate genes and cellular localization annotations abstracted from the comment field. SwissProt entries with cellular localization annotations were identified for 456 (71%) of 643 putative biomarkers, of which at least 1 protein variant of 114 genes (18% of total) was annotated as secreted, extracellular, or soluble. The prediction method and database annotation displayed concordance for 104 (91%) of 114 proteins. Of the 10 discordant predictions, 2 proteins were annotated to be secreted independently of the CTT pathway and consequently were not detectable by our prediction methods. The 72 biomarkers predicted to be secreted, but not annotated as secreted, consisted of 36 proteins lacking any SwissProt annotation, 24 proteins annotated as membrane proteins, and 12 proteins annotated as other nonmembrane CTT proteins (including Gogli, ER, lysosomal, and glycosylphosphatidylinisotol-anchor proteins).
We used tissue expression profiles to estimate the signal-to-background expression level of a candidate marker and to determine the likelihood that the markers diagnostic signal will be distinguishable in a serum assay. For candidate markers, we derived the relative expression pattern across different tissue types from a compendium of public transcriptomic databases, including the Cancer Genome Anatomy Projects Serial Analysis of Gene Expression database (9), the Ludwig Institute for Cancer Researchs Massively Parallel Signature Sequence (MPSS) database (10), and NCBIs Unigene database (11). The Serial Analysis of Gene Expression database contains gene expression data on 22 major tissue types in both nondiseased and cancer tissues. The Massively Parallel Signature Sequence database reports gene expression in 32 tissues, with a much higher dynamic range of transcripts per cell than either of the other methods. We used the transcripts-per-million counts reported by the above methods to manually construct tissue specificity profiles on a gene-per-gene basis. From these profiles, we can categorize genes as specific or ubiquitous by use of a binary classification for expression/nonexpression in each tissue type or a numeric classification by percentage of total transcripts in each tissue type, normalized by the tissue-type library size.
Within the 643 gene products analyzed by the secreted protein prediction methods, several genes with well-characterized properties can be used as controls with our approach (Table 1
). For these genes, we used tissue expression profiles to classify the relative prostate tissue specificity of the gene products. Included in the list of genes are 2 positive controls, KLK3 (kallikrein 3 (prostate specific antigen) and ACPP (acid phosphatase, prostate), 2 prostate biomarkers that are currently used in clinical testing. We predicted these 2 gene products to be secreted (confirming SwissProt annotation) and to possess strong expression in prostate tissue and minimal expression in other tissues. AMACR (
methylacyl-CoA racemase) (12), HPN [hepsin (transmembrane protease, serine 1)] (13), and ZWINT (ZW10 interactor) (14) are all associated with prostate cancer in the scientific literature but are known to not be secreted and were correctly identified as such by our methods. It should be noted that AMACR is used as a prostate tissue immunohistochemical marker; however, the lack of prostate specificity and intercellular localization of its gene product make this gene a poor serum biomarker. FN1 (fibronectin 1) and VEGF (vascular endothelial growth factor) are associated with prostate cancer and encode secreted proteins; however, these genes lack prostate-specific tissue specificity (15)(16)(17) (18). The lack of prostate tissue specificity in the expression of these 2 genes may be a major reason why they are not yet used clinically as prostate cancer serum biomarkers.
|
We have developed a bioinformatics protocol for screening candidate serum biomarker sets to identify high-quality markers for experimental evaluation. The in silico secreted protein pipeline provides a rapid screen for identifying biomarkers found extracellularly and is likely to be detectable by serum assays. Tissue specificity profiling compliments secreted protein prediction by identifying the originating tissue components of a biomarkers serum signal and by allowing investigators to select candidate markers with a higher probability of having distinguishable signals. We hope that the use of intelligent bioinformatics analysis before costly experimental evaluation will accelerate the selection of candidate biomarkers that can be successfully translated into novel, clinically useful, assays.
Acknowledgments
This work was supported under a grant by Invitrogen.
References
The following articles in journals at HighWire Press have cited this article:
![]() |
G. Vasmatzis, E. W. Klee, D. M. Kube, T. M. Therneau, and F. Kosari Quantitating tissue specificity of human genes to facilitate biomarker discovery Bioinformatics, June 1, 2007; 23(11): 1348 - 1355. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Finlay, E. W. Klee, C. McDonald, J. R. Attewell, D. Hebrink, R. Dyer, B. Love, G. Vasmatzis, T. M. Li, J. M. Beechem, et al. A systematic method for selection of promising serum protein biomarkers to improve prostate cancer (PCa1) detection. Clin. Chem., November 1, 2006; 52(11): 2159 - 2162. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |