Best Presentation
Conference Fees
Preconference Seminar
Keynote Speakers
Call for Papers
Call for Posters
Contest Datasets
Scientific Committee
See and Do
CAMDA 2006 Conference Contest Datasets

This year the CAMDA 2006 challenge dataset comes from the CDC Chronic Fatigue Syndrome Research Group.  The dataset currently contains microarray, proteomics, SNP, and clinical data:

Data *Subjects Description Format
microarray 177 Single-channel, gold labelling CSV, TIFF
proteomics 60 SELDI, 6 fractions per sample CSV, XML
SNPs 50 singlue nucleotide polymorphisms CSV
clinical 227 Blood profile, urine metabolite, physical, demographic CSV

*Data on all subjects will be released if it becomes available.

Downloadable data (zip format and gzip/tar format) (ftp.camda.duke.edu/CAMDA06_DATASETS/):

Note: only 9 concurrent downloads are allowed at any one time.  Web browsers tend to create multiple connections when FTPing, so we strongly encourage that you download a single file at a time via a command-line ftp client during off-peak hours (between 6 PM and 8 AM, anytime on weekends).

Some people have experienced problems unzipping the files with standard unzip utilities. However, WinZip should work in all cases.

CAMDA 2006 Data Set Policies:

In the past we have made part of the challenge decoding the data.  However, this year's dataset provides a level of complexity in the dataset previously unseen within CAMDA.  Thus, we are encouraging people to use the public discussion board to ask questions of each other and the CAMDA staff.  We will try to be responsive to questions without comprimising the integrity of the competition.

 CAMDA 2006 Data Set FAQ:

Question: What is the chip name for the microarray data?
Answer: The chip is custom printed, so it has no name

Question: Why are there only 177 subjects that have microarray data while there are 227 subjects with clinical data.
Answer: Details of the microarray data acquisition can be found in the microarray paper in the publications zip file.

Question: Is there a description of the proteomics data?
Answer: Yes - see "Researc Plan_CAMDA.doc" in the proteomics_data.zip file.

Question: For the proteomics data, which files are control and which are case?
Answer: See the *.csv files in the root directory of the proteomics_data.zip file.

Question: Is it possible to get the CEL files for the microarray data?
Answer: No.  CEL files are only generated by Affymetrix microarray experiments, whereas the dataset is acquired from glass-slide arrays.

Question: Is there any more information about the microarray data fields, such as Spot labels, ARM Dens - Levels, etc.?
Answer: Yes.  The publications describe the software used for spot identification.  You can then go to the software company's website to get definitions of the column names.

Question: What steps were used in background subtraction of the proteomics data?
Answer: Data in the "No Bkgd" folders is the raw, unprocessed proteomics data.  Data in the "Bkgd Subtracted" folders were processed with the following parameters:

Noise segment from = 0.0
Noise segment from type = 12
Noise segment to = 1.0
Noise segment to type = 11
Noise fit width = 100.0
Noise fit width type = 5
Noise fit at least = 1.0
Noise fit at least type = 13
Noise sigma multiplier = 3.0
Noise number points = 10
Baseline smooth enabled = true
Baseline smooth width = 25

Question: Could you please explain the meaning of the spectra names in the H50_Proteinchip Info table and how they relate to patient identifiers?
Answer: The sample name refers to the Subject ID, and each serum sample from each subject is run under all conditions.  The spectrum name is the Barcode of the Proteinchip with the well number - this is the unique ID for each spectrum.  For example:

Line 738 = 1 080055198 (Bar code) = 29030401 (Subject ID) who is NF (=not fatigued= Sample group) this was run in B (=SpotName : where it was on the ProteinChip) = F1 (Fraction name: sera was frationated F1 to F6)  on an H50 chip (=Array type) at a laser reading of 180 (= Laser low)

So, each subject has several readings:

There are 4 chips
There are 6 fractions
There are 2 laser settings (Low and high)
And each sample is done in duplicate.

Question: There are 288 F4 records for H50 High energy in the excel spreadsheet but only 142 actual spectra files.  What accounts for this?

Answer:  A glitch in the spot protocol caused several chip/fraction combinations to be read at different laser intensities.  Thus, some fractions have "extra" spectra.  H50 high energy F4 was not in this group, which is not reflected in the spreadsheet but is reflected in the number of spectra files.

Question: Is there a way to connect a given peak in a spectrum with its identified peptide or protein?

Answer: There is only "indirect" information linking a protein to an MZ value.  At the website http://us.expasy.org/tools/tagident.html you can enter the information you have on the protein regarding MZ (and pI = isoelectric point, this is related to the fraction that the peak come off in).  You will get several possibilities for the protein ID.  However, the "correct" ID of a protein would involve further experimental work.  SELDI is an explorative tool, it does not give definitive answers on peptide and protein identification.


Last modified on 08/23/2006 12:54:08

© Duke Comprehensive Cancer Center