Message Board
Email copy of message FAQ answer is ambiguous about microarray data
Subject: FAQ answer is ambiguous about microarray data Author: Earl F. Glynn Date: 29/Nov/2005
From existing CAMDA 2006 Data Set FAQ:

Question: Why are there only 177 subjects
that have microarray data while there are
227 subjects with clinical data.

Answer: Details of the microarray data
acquisition can be found in the microarray
paper in the publications zip file.

This answer is still ambiguous: What does "the microarray paper" mean?

- The Whistler 2003 paper describes a microarray study of 23 women chosen from 43 CFS subjects using the Atlas Human 3.8I microarray. There's no mention (that I can find) of the 177 subjects.

- The Whistler 2005 paper describes a study with 5 CFS women and 5 female controls using the Atlas Human 3.8I microarray (same as earlier paper).

- The Nicholson 2004 paper describes a microarray study of 12 individuals using the Human 10K A, B and C microarrays from MWG Biotech.

So two kinds of microarrays but the 177 files appear to be from the same microarray? [I'll post a separate item why it's not clear the data are from the Atlas Human 3.8I microarray.]


Nowhere can I find a description of these 177 microarray files, nor do any of the papers cite a supplementary web page with this information.

The first 8 characters of the microarray data filenames are assumed to match the "ABTID" field in the clinical and blood work datasets.

Starting with 227 patients with clinical data, I'm only seeing 221 with complete blood evaluation data. Of these 221, there are 156 with microarray files, and 8*2=16 that seem to be some sort of sets of replicates.

> replicates
[1] "10243501" "20052705" "20129103" "20717901" "20866603" "22803203" "26153406"
[8] "28542303"

In addition there are five arrays with no corresponding clinical data:
> GeneExpressionNoClinical
[1] "22149604" "22340204" "23366103" "24120805" "28293201"

So the 177 gene expression datasets consist of 156 that have complete blood evaluation data, 8*2=16 with two replicates and blood evaluation data, and 5 without clinical data. 156+16+5=177

So if the Whistler 2003 paper only addresses 23 of 43 CFS subjects, where are the details of this microarray data described? Where are the 23 from this study identified by their "ABTID" field?

The values cited above come from this analysis using R:

> # Gene Expression
> txt.list <- list.files(path="U:/camda/2006/gene_expression_data/Gene Expression", pattern=".txt")
> txt.list <- gsub(".txt", "", txt.list)
> cat("txt", length(txt.list), "\n" )
txt 177
> txt.unique <- unique(substr(txt.list,1,8))
> print (length(txt.unique))
[1] 169
>
> tif.list <- list.files(path="U:/camda/2006/gene_expression_data/Gene Expression", pattern=".tif")
> tif.list <- gsub(".tif", "", tif.list)
> cat("tif", length(tif.list), "\n" )
tif 175
> tif.unique <- unique(substr(tif.list,1,8))
> print (length(tif.unique))
[1] 167
>
> cat(" txt files without tifs\n")
txt files without tifs
> TxtNoTif <- setdiff(substr(txt.list,1,8), substr(tif.list,1,8))
> print(TxtNoTif)
[1] "10860201" "20465002"
>
> GeneExpressionNoClinical <- setdiff(txt.unique, WithBlood)
> GeneExpressionNoClinical
[1] "22149604" "22340204" "23366103" "24120805" "28293201"
> setdiff(tif.unique, WithBlood)
[1] "22149604" "22340204" "23366103" "24120805" "28293201"
>
> cat(" txt duplicates\n")
txt duplicates
> CountExpression <- table(substr(txt.list,1,8))
> replicates <- names(CountExpression[CountExpression > 1])
> replicates
[1] "10243501" "20052705" "20129103" "20717901" "20866603" "22803203" "26153406"
[8] "28542303"

 
User email: