Madcow Tutorial for MicroArray Data Correlation
http://cardioserve.nantes.inserm.fr/ptf-puce/
Table of Contents
Introduction
p2
Chapter I: data processing and databases 1/ Data source 2/ Filtering and normalization 3/ Calculation of distances 4/ Validation of coexpression links 5/ Databases
p3 p3 p3 p3 p4 p4
Chapter II: User’s Guide 1/ The input interface 2/ The results interface 3/ Create a coexpression network
p5 p5 p7 p10
1
Introduction:
Madcow is a web tool that identifies genes that are coexpressed with a specified gene in multiple microarray data sets. Most existing studies have shown that confirmation of coexpression in multiple data sets is correlated with functional relatedness. Madcow is a web tool that queries a coexpression database with experiment filtering and several levels of significance. Results can be filtered, compared and annotated by identification of statistically overrepresented Gene Ontology terms. Moreover, the user may visualize a coexpression network from the results by using the Cytoscape tool.
2
Chapter I: data processing and databases 1/ Data source:
Raw data come from GEO (Gene Expression Omnibus), a public repository of microarray data sets (available at www.ncbi.nlm.nih.gov/geo ). We have data from: - GDS (GEO DataSets) that contain raw data of samples assembled into biologically meaningful and comparable data sets. - GPL (GEO Platforms) that describe the list of elements on the array.
2/ Filtering and normalization:
DataSets in GEO can have two value types (count or log ratio) that depend on the platform used. Prior to conducting any of the analyses, genes with low signal intensity (except from log ratio value data), missing values and genes exhibiting little variation across the collection of arrays are excluded. Data are adjusted using median center genes and log transforms (except log ratio).
3/ Calculation of distances:
We have selected only GDS containing a minimum of 10 samples to calculate similarity scores for all pairs of genes using Pearson’s correlation:
r=
∑ ( X − X )(Y − Y ) ∑ ( X − X ) ∑ (Y − Y )
2
2
Comparisons between genes involving fewer than five data points due to missing values have been discarded. Pearson's correlation reflects the degree of linear relationship between two variables (expression profile of two genes). It ranges from +1 (perfect positive correlation) to -1 (perfect negative correlation).
Expression profile of two genes across 6 samples. Red indicates a positive correlation and black indicates a negative correlation.
3
4/ Validation of coexpression links:
To validate the distance between two genes in one data set, the bootstrap method is applied. After filtering and normalization, raw data are mixed and similarity scores recalculated. All these scores are classified by ascending order and thresholds are positioned. Example: For GDS184, the threshold of 0.001% for positive correlation has a Pearson’s coefficient of 0.852. This means, two genes with at least a Pearson’s correlation of 0.852 have one chance out of 100 000 of appearing by chance in GDS184.
5/ Databases:
The Madcow tool uses 3 databases: - Madcow database, which contains significant coexpression links, thresholds associated with GDS, and information on GDS. - Madgene database, which contains the links between disparate IDs for the same gene (symbol, accession number, clone ID, cluster unigene ID, synonym, GPL ID) and the links between orthologous genes. Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. - Gene Ontology (www.geneontology.org) database, which contains a dynamic, controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
4
Chapter II: User’s Guide 1/ The input interface:
Figure 1- Input window
The input interface is composed of 5 areas. The first time you use Madcow, you must create an account (“Register”). Madcow sends by mail your login name and password. Afterwards, you can change it by clicking on “Your Account”. Your email address must be valid. The first area ( , Fig. 1) contains the fields where the identifiers for genes of interest are entered (Supported ID types include gene symbols, synonyms, clone IDs, GenBank accession numbers and Unigene cluster IDs). Gene identifier species depend of selected options in the second area . If you want study all species to search gene coexpressions, a new list appears to choose gene identifier species, else gene identifier species is the same as study species.
5
Madcow gives you the option of entering a single identifier or to enter a batch file (text file with one identifier per line, maximum of 20 genes). “Search genes by GO” allows you to enter a Gene Ontology identifier (example: “GO:0006350” for “transcription”). All selected genes (symbols) will be entered into a batch file that the user is able to download. The second area concerns search options about the threshold, the type of correlation (positive by default) and study species ( , Fig. 1). The significance of each correlation is calculated by bootstrap. Threshold is the probability to have two coexpressed genes in a data set by chance. Madcow contains both positive and negative correlations between genes. You can select one type of correlation or both. The third area concerns the options about microarray experiments ( , Fig. 1). In the Madcow database, datasets have at least 10 samples (microarrays). You can filter experiments used to calculate coexpression links on the minimum number of samples in the dataset or on a keyword present in the description of the datasets. In the case of experiments filtering by a keyword, the interface will ask you to choose among a list of experiments before searching for the coexpression neighbors. All selected experiments will be used. The fourth area ( , Fig. 1) is optional, it allows the user to append comments to the current analysis to help distinguish the different analyses in his space. The fifth area has options to commence the analysis (‘send’) or clear the input interface (‘clear’) ( , Fig. 1). You will receive an alert via email when your results are ready to view in your personal space.
6
2/ The results interface:
Figure 2- Output window
A user’s account includes space for the storage of results (fig. 2). You can download results in XML format, delete or view results by clicking on the eye icon. In the latter case, another window will appear containing the results (fig. 3). In this window, the results are shown gene by gene. The links “previous page” and “next page” on the top and bottom of the page are used to navigate among the genes of the request in the case of batch requests. On each page, a summary of the request is displayed at the top ( , fig. 3). The official symbol and name of your input identifiers are given ( , Fig. 3), Gene Ontology terms associated with your gene (by clicking on “GO annotation”), and the gene ID card with Madsense.
7
Figure 3- a result view
Genes that are coexpressed with your input gene are classified by the number of distinct experiments (in descending order) where the pair of genes are coexpressed (“number of links”) ( , Fig. 3). The list of experiments where coexpression is found are given with a link to the description of the experiment ( , Fig. 3). The same information that is available for the inputted gene is also present for each neighbor ( , Fig. 3). Finally, the link “GO summary” allows you to find statistically overrepresented Gene Ontology terms within the list of best neighbors ( , Fig. 3). You can choose the number of neighbors (with an official symbol) included in the annotation. Results are ranked by ascending order according to the p_value calculated by Fisher’s exact test (fig. 4).
8
Figure 4- Statistical GO annotation of neighbors
Madcow allows you to re-analyze several results (fig. 2). First, select results by checking the checkbox. Next, choose "filter", "intersection" or "union": • "filter by ontology" allows the user to recursively filter the genes by ontology. This offers 2 operating modes: select or cut. If the select option is selected, the neighbors that have an ontology corresponding to the entered GO ID or keyword are selected, and all others will be excluded. Conversely, if the cut option is selected, the neighbors not corresponding to the entered GO ID or keyword are selected. This filter is recursive since the results of a filtration can be filtered again. • "intersection" aims to find in the selected files those neighbors that are the most often present among the lists of neighbors. It provides a table in which the official symbol, the symbol name and the ontology of the “most popular” genes are displayed. • "union" is used to concatenate all the selected files. It creates a new file in the user’s space. This process may be useful in order to run the ‘filter’ job on several files.
9
3/ Create a coexpression network:
Figure 5- Create a Cytoscape file
In addition, the user may visualize a network from the results by using the tool Cytoscape (http://www.cytoscape.org/). This tool needs an input file (in .sif format) containing the data. Madcow can create this file and allows the user to download it (cf. fig. 5). To create a file suitable for Cytoscape, Madcow asks the user to select the number of neighbors considered for each result with the “Number of best links” field (cf. fig. 5). If the user chooses no neighbors for a particular result, this one will be ignored. You must choose gene identifier species (network with gene symbols) to create coexpression network. The Madcow site contains links to download Cytoscape and the MCODE plug-in for Cytoscape. MCODE finds clusters (highly interconnected regions) in any network loaded into Cytoscape.
10