|
Introduction Constellation Technologies, a company based at the Rutherford Appleton Laboratory near Oxford, using technology from CERN, Geneva, performed a Cloud computing pilot project for a large pharmaceutical company. The project was done in collaboration with colleagues from EBI and the University of Cambridge. This project showed that considerable performance improvement (up to 12 times faster calculations) was achieved with cloud computing services. In addition to the main pilot project, there was a “stretch task” to parallelise the open source MAT application that was being used in the main task. This report is the conclusion of this “stretch” work.
Summary of MAT theory http://liulab.dfci.harvard.edu/MAT/
Model based Analysis of Tiling-arrays (MAT) is open source software based on the algorithm developed at X. Shirley Liu Lab, Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute, Harvard School of Public Health. MAT estimates the baseline probe behaviour based on probe sequence characteristics and genome copy number. It allows for simultaneous array normalisation and background adjustment thus enabling single-chip standardisation with no need to do sample normalisation (i.e. T values for robes can be directly compared across a number of chips). Parameters are estimated on 400,000 probes (81 parameters in total): because each whole-genome tiling array contains ~6 million probes, the over-fitting of models with a few hundred parameters is not a concern. For the whole genome tiling arrays, MAT estimates the 81 parameters from randomly selected 400,000 probes instead of from all of the probes on the array to save memory. MAT predicts baseline intensity for each probe, based on sequence and copy number. In order to estimate the probe variance, MAT divides the probes on the array into ‘‘affinity bins’’ each containing a few thousand probes with similar baseline intensities; in the original paper: 100 affinity bins, ~3,000 probes each. It estimates the observed sample variance within each affinity bin and uses it as the probe variance for each probe in the bin .
MAT can detect regions enriched by TF ChIP-Chip in three different scenarios: single sample, multiple replicates, and multiple replicates of ChIP-chips and controls. In a single sample, a MATscore will be calculated from all of the probes within each 600-bp sliding window. With multiple replicates, a MAT score will be calculated for each window by pooling all of the probes across all of the replicates. Even though the replicates might have similar trimmed mean T values, having more replicates and more probes in the window will give higher confidence to the prediction. This process removes any cell-specific variations that are not modelled in MAT and increases the confidence of ChIP region predictions that are marginally significant from the ChIP-only samples.
Parallelised implementation of MAT For parallel analyses, we re-ordered probes in genomic order and built a model for each 400k batch of probes. CEL files were split into a number of pieces: the number could vary depending on the number of CPU cores available. As a result, we get several ‘virtual chips’ that could be analysed in parallel on multicore machines or the Grid thus improving performance of the system.
Conclusions Our MAT parallelisation solution demonstrated significant performance improvement. Consultations with the algorithm inventors and industry experts suggest that the approach to parallelisation we used doesn’t affect statistical validity of the results though some minor discrepancies with the results from the original version of MAT are possible. In addition to the speed gain, the suggested approach is potentially more accurate in cases of the large regional signal bias. This could be tested and validated using several different methods (e.g. on datasets like ENCODE spike-in, see more Johnson et al, Genome Res, 2008) before implementing parallelised MAT in real-life projects.
Last Updated ( Monday, 08 February 2010 15:41 )
|