What is this for?
cREMaG is a database
interface allowing for detection of the over-representation of transcription
factor binding sites (TFBS)
in a queried set of
co-expressed genes. If the genes are co-expressed it is highly probable that
they are co-regulated.
Analysis of common
properties of their promoters could suggest the transcription factors
responsible for their co-regulation.
In this tutorial we will
not explain every function by detail. We will just pass through the whole analysis
process. The detailed help can be accessed by moving a mouse cursor onto
question marks as shown on the Figure 1.
Figure 1.
First input your query name.
Please, fill the text field highlighted by the red box. We will use ‘Tutorial’
as a name of our query.
Figure 2.
Next, select ID species
and ID type. We will use Mus musculus and Ensembl ID for the tutorial example.
Figure 3.
Finally, we have to enter
our IDs. We have chosen Egr1, Egr2 and Egr4 genes for this tutorial as genes
controlled by SRF transcription factor (Ramanan N, 2005). cREMaG engine is
based on Ensembl IDs so it is recommended to use them. Ensembl IDs for the
chosen genes are ENSMUSG00000038418, ENSMUSG00000037868, ENSMUSG00000071341.
You can fill the text field by copying these IDs or by just pressing the
‘Example’ button. Finally we will focus only on the first transcription start
sites by selecting the Most Distal TSSs option. Now, You can press the ‘Next
Step’ button.
Figure 4.
In the Step II we have
information, which of the IDs were identified. For every identified gene we
have detailed information about its transcription start sites. For example for
Egr2 gene we have information what is the MGI Gene Symbol of this gene, what is
the Ensembl ID and what was the submitted ID. We have submitted Ensembl IDs so
the Ensembl Gene ID and Query ID are the same. Next, in a little bit brighter
box, there is information about the Ensembl ID of the homologous gene used for
phylogenetic footprinting. The phylogenetic footprinting is the procedure which
allows to obtain evolutionary conserved sequences. This is performed to reduce
noise in the analysis. Finally, we have brightest boxes with information about
the particular transcription start sites (TSSs): how many base pairs it is from
the first TSS (of course it is equal 0 if it is first TSS), how long are the
CpG islands around the TSS and what are the Ensembl IDs of particular transcript
forms. It is possible to select or deselect particular TSSs using checkboxes if
we want to add or remove them from analysis. Finally, please press the ‘Next
Step’ button. Please, notice that the second TSS (9bp) for the Egr2 is not
selected! We will come back to this in the Step III.
Figure 5.
In the Step III we have
multiple options, which are well described in popup windows after moving mouse
into them. We will pass to the next step using the default values.
Figure 6.
In the Step III the information
about TSSs is repeated. The included (selected in the Step II) TSSs are marked
with yellow line, and excluded with red line. Please, press the ‘Next Step’
button after inspection.
Figure 7.
Finally, we get the
results page. In the Query Info table, we have the selected options shown. In
Your genes table, the queried genes are shown and number of TFBSs on their
promoters which met the specified criteria (highlighted by the red box).
Figure 8.
In the JASPAR TFBSs Table
we have the most over-represented JASPAR matrices in our query gene set. The
most over represented binding site is SRF as expected (Ramanan N., 2005),
followed by ELK4, ELK1 and CREB1 which are also true positive results.
Figure 9.
The JASPAR TFBSs table
contains detailed information. ‘IC’ is the Information Content of matrix. The
higher the IC, the more specific matrix is. ‘Genes’ column contains symbols of
genes having the SRF binding site on its promoter. ‘TFBSs Number’ is the total
number of TFBSs found on the queried genes promoters and is compared to number
of TFBSs which would be expected by chance. Next, there is fold-difference and
p value of this difference. ‘Genes Number’ is the number of genes which have the
binding site on its promoter and is compared to genes expected number to get
fold-difference and its p value. High chance for getting true positive result
is when there is both TFBSs Fold p and Genes Fold p < 0.01.
Figure 10.
You can also visualize
your results by selecting matrices to visualize by clicking checkboxes and
pressing the ‘Visualize’ button. We will select SRF and CREB1 for
visualization.
Figure 11.
The visualization screen
contains plenty of information which are not as well visible in text form. We
will not get into detail but it is clearly visible that binding sites for SRF (orange
ones) are situated on the highly conserved CpG islands on the core promoter.
Figure 12.
You can also get the
detailed information about the binding sites by clicking the ‘Detailed Table’
button. We will use SRF, ELK1, ELK4 and CREB1 transcription factors to see the
detailed information, because there are working together as transcriptional
module.
Figure 13.
In the detailed view there
is position, score and conservation level of matrices. In such type of view it
is possible to identify regulatory modules as ELK1-SRF for example.
Figure 14.
Now You are ready to use
cREMaG. If You have any questions, please send e-mail to marpiech@if-pan.krakow.pl.