CAMDA Discussion Board

To post messages to the discussion board you must login.

Message Board
View type   
Per page
Total messages: 29     | Prev | Next |
Subject File Author Date
Only ~3800 genes but 20,160 probes on Atlas Human 3.8 I Microarray? Earl F. Glynn 29/Nov/2005
 

Re: Whistler's paper, "Integration of gene expression, clinical, and epidemiologic data to
characterize Chronic Fatigue Syndrome" says this (is this "the microarray paper" from the FAQ?):

"Forty-three CFS subjects were identified …"
"…leaving 23 women for analysis"

But there are 177 microarray datasets. Are the 43 or 23 identified anywhere in the set of 177? Or, we can only use this paper in general, and must re-connect clinical data and microarray data again?

The Atlas Human 3.8I glass microarray was used:

Preparation and hybridization of labelled cDNA

The cDNA probe
was hybridized to the Atlas™ Human 3.8I
oligonucleotide
glass microarrays (CLONTECH Laboratories,
Inc., Palo Alto, CA)

Later the paper talks about

"the mean gene expression for each of
3,800 genes"

The value 3,800 for the number of genes is consistent with the number of genes for the Clontech Atlas Glass Human 3.8I information here:

http://www.clontech.com/clontech/atlas/genelists/index.shtml

This 3800 number is roughly consistent with the geometry of the slide given in the PDF
http://www.clontech.com/clontech/atlas/genelists/7903-1_HuGlass38I.pdf

The diagram on p. 1 of this PDF show 12 rows of 4 blocks with each block being a 9-by-9 grid. This means 12 * 4 * 9 * 8 = 3888 spots.

Since block E3 is all controls, and two spots in each block are used for other purposes (orientation marker and negative control), this means the number of genes should be approx:
(12*4 – 1) * (9*9 – 2) = 3713 genes.

This 3713 value is approximately equal to the 3800 value, so that seems to make sense until one inspects the files in the gene_expression_values.zip. These .txt files all seem to have 20,160 probes. So, how does one reduce the data for 20,160 probes to about 3800 genes?

The following short R script show the locations of the spots from the PosX and PosY values in columns 6 and 7 for one of the datasets. For example,

expression <- read.delim("U:/camda/2006/gene_expression_values/10043905-A-03R.txt")
dim(expression)
pdf("AtlasGlassHuman3.8I.pdf", width=8, height=10.5)
plot(expression[,6], expression[,7], type="p", pch=".", xlab="PosX", ylab="PosY",
main="Atlas Human 3.8I Microarray: 10043905-A-03R.txt")
dev.off()

The resulting PDF shows the expected 12 rows of 4 blocks (as expected from the Atlas Human 3.8I glass microarray documentation):
http://research.stowers-institute.org/efg/2006/CAMDA/AtlasGlassHuman3.8I.pdf

BUT notice each block has 20 rows and 21 columns, NOT a 9-by-9 block.

This means (12 * 4 blocks) (20 *21) = 20,160 spots.

This was verified with two other datasets, 21656101A.txt and 29430601A_027A.txt (roughly, first, middle and last from the list of file names). The same number of spots, 20,160, was observed in all three.

So, why do all the 177 gene expression datasets have 20,160 probes, but the paper only talks about 3800 genes? And the vendor's web page only talks about 3800 genes?

What am I missing? Are 16,000+ of the probes in the expression datasets to be ignored? Should most of the data be ignored and only probes with NM_xxxx GenBank identifiers be used?

The vendor gives a list of the 3757 probes here with additional information for the Atlas Glass Human 3.8I Microarray:
http://www.clontech.com/clontech/atlas/genelists/excel/huglass3.8k.xls
http://www.clontech.com/clontech/atlas/genelists/7903-1_7904-1_HuGlass38.txt

Of the 3757 almost all (3747) have a GenBank Accession Number prefixed with "NM_".

The 177 microarray datasets appear to each have 3102 probes with a prefix of "NM_". Do most of the 3102 match the 3757 since both sets have a prefix of "NM_"?

To answer this, I created a slightly modified version of 7903-1_7904-1_HuGlass38.txt (from http://www.clontech.com/clontech/atlas/genelists/7903-1_7904-1_HuGlass38.txt) so R will read the file
[delete the first two lines, remove "#" and "(s)" from column names so the first line appears like this:

Gene<T>Coordinate<T>GenBank Accession<T>LocusLink Accession<T>Swiss-Prot Accession<T>Gene Name, where <T> is a tab]

>filename <- "U:/camda/2006/www.clontech.com/clontech/atlas/genelists/7903-1_7904-1_HuGlass38X.txt"
>atlas <- read.delim(filename, as.is=TRUE)
>dim(atlas)
[1] 3757 6

># Look at what matches between "expression" dataset and annotation info about the chip.
>both <- merge(expression, atlas, by.x="Spot.labels", by.y="GenBank.Accession")
>dim(both)
[1] 1156 22

So only 1156 "NM_" prefixed genes match between each of the 177 microarray datasets and the vendor-provided information about the genes (including Swissprot ID, Locus Link Accession – now called Entrez gene, and gene name).

What am I missing about the gene expression datasets? Are they really Atlas Human 3.8 I microarrays as documented by the vendor? Am I supposed to ignore 20,160 -1,156 = 19,004 of the probes?

A look at the first 20 and last 20 "NM_" IDs doesn't reveal a pattern (to me) but suggests many mismatches:

From Expression datasets:

>expressionNM[1:20]
[1] "NM_000020" "NM_000025" "NM_000026" "NM_000028" "NM_000030" "NM_000032" "NM_000037"
[8] "NM_000042" "NM_000044" "NM_000047" "NM_000051" "NM_000053" "NM_000057" "NM_000059"
[15] "NM_000060" "NM_000068" "NM_000073" "NM_000076" "NM_000079" "NM_000081"

>expressionNM[ (length(expressionNM)-20) : length(expressionNM) ]
[1] "NM_080718" "NM_080722" "NM_080730" "NM_080731" "NM_080733" "NM_080741" "NM_080758"
[8] "NM_080760" "NM_080797" "NM_080798" "NM_080804" "NM_080808" "NM_080838" "NM_080874"
[15] "NM_080881" "NM_080914" "NM_080915" "NM_080918" "NM_080921" "NM_080922" "NM_080923"


From Atlas Glass Human 3.8I Microarray vendor annotation dataset:

>atlasNM[1:20]
[1] "NM_000014" "NM_000015" "NM_000016" "NM_000017" "NM_000018" "NM_000019" "NM_000020"
[8] "NM_000021" "NM_000022" "NM_000023" "NM_000025" "NM_000026" "NM_000027" "NM_000029"
[15] "NM_000030" "NM_000031" "NM_000032" "NM_000034" "NM_000035" "NM_000036"
11/29 16:27:16

>atlasNM[ (length(atlasNM)-20) : length(atlasNM) ]
[1] "NM_007117" "NM_007118" "NM_007168" "NM_007193" "NM_007202" "NM_007203" "NM_007212"
[8] "NM_007231" "NM_007245" "NM_007246" "NM_007249" "NM_007264" "NM_007269" "NM_007294"
[15] "NM_007312" "NM_007315" "NM_007317" "NM_007341" "NM_007353" "NM_007355" "NM_007374"

More explanation of the gene expression datasets is needed.

Download attachment: AtlasGlassHuman3.8I.R
Send an email to the author of this message Email a copy of this message to a friend  
Re: GAL files from microarrays images   Patrick McConnell 30/Nov/2005
RE: About Microarray   Patrick McConnell 30/Nov/2005
Re: SNP Data Identification   Patrick McConnell 30/Nov/2005
FAQ is wrong about "the microarray paper": No publication describes actual microarray data to be analyzed   Earl F. Glynn 06/Dec/2005
ArrayVision RLS was used to analze Gene Expression Data   Earl F. Glynn 16/Dec/2005
Proteomic Data: Howto get patient identifiers? Meaning of Spectrum names is not clear.   Peter Bewerunge 28/Dec/2005
New relevant paper: Chronic Fatigue Syndrome - A clinically empirical approach to its definition and study   Earl F. Glynn 09/Jan/2006
Paper that seems to explain SNP Excel worksheets, COMT, CRHR1, ... TH, TPH2   Earl F. Glynn 18/Jan/2006
No notification when SNP download file was changed on 11/7/2005?   Earl F. Glynn 18/Jan/2006

Last modified on 01/28/2004 10:50:38

© Duke Comprehensive Cancer Center