
Help Documentation
Help Contents
1. Development of the PRI Project
2. Tutorial: Search for PRIs with selected transcription factor binding site(s)
3. Tutorial: Search for PRIs within a gene of interest
4. Search for PRIs within user-defined sequence sets
1. Development of the PRI Project
To begin with, we hypothesized that binding sites responsible for regulating transcription of the corresponding gene are conserved in order and spacing across mammalian genomes. In building the algorithm, we employed the genomes of human (May 2004 assembly), mouse (March 2005 assembly), and rat (June 2003 assembly). To identify putative transcription factor binding sites, we used an in-house curated version of the transcription factor database (TFD). The main difference between this curated version and the public version is that we cleaned up much of the redundancy, which could potentially affect the robustness of the algorithm. Starting with the genome database and this curated transcription factor database, we located all potential binding sites in the range of -20kb to +20kb centered around the start codon of each annotated gene in each of the mammalian genomes. The sequences extracted for this purpose are in the same orientation as that of the gene feature. In order for a binding site to be flagged as a member of a regulatory island, this binding site has to be conserved across all mammalian genomes tested. Thus, the binding sites that are not conserved are discarded. To enforce the order and spacing between binding sites, we calculated the difference of the relative positions of each binding site in the human and mouse genomes or the human and rat genomes with respect to their corresponding genomic translational start site. We termed this number set the genomic position offset. Each of the binding sites that are conserved in mammalian genomes is labeled using the genomic position offset. In the last step, the binding sites that contain the same genomic position offset are grouped together and termed a Pattern-defined Regulatory Island (PRI). It should be stressed that the island is grouped together from the transcription factors that share the same genomic position offset and thus spacing must be conserved across genomes, but the length of the spacers is not constrained. Therefore, it is apparent that any observed binding site clustering is a natural phenomenon and not an artificial imposition.
We applied this algorithm to every gene in the genome for which orthologs have been identified in all three species (as defined by NCBI Homologene). At this point it was not clear how many binding sites are necessary to constitute a real regulatory region. In order to identify the minimum number of binding sites required in a PRI, we performed a computational false positive analysis. We first asked how many binding sites can be arranged in order and spacing by chance across three random sequences using our algorithm. In doing so, we selected the 1000 genes with the largest (greatest number of binding sites) PRIs in our database and extracted the -20kb to +20kb region centered around the start codon of each gene in each genome. Each of these sequences was then randomized while holding their GC distribution constant in search for PRIs using our algorithm. We performed this analysis 100 times on each of the genes, resulting in 100,000 iterations. We compared the number of binding sites in a PRI found in the natural sequences to that of the scrambled sequences. In the natural sequence scenario, we observed many PRIs that contain a very large number of binding sites. In the case of the scrambled sequences, it is interesting that we do not observe any PRI that contains more than 9 binding sites, even though we have 100 times more sequences. As a result, we found that we achieved 99 percentile confidence when a PRI contains at least 7 binding sites. However, this is simply a computational index and does not necessarily mean that PRIs with fewer than 7 binding sites are not likely to be functional, but the chances of an island with fewer than 7 binding sites being a reflection of redundancy in the transcription factor database is higher. This website will return results for all islands with 5 or more binding sites conserved in order and spacing, but those with fewer than 7 should be approached with more caution. All islands explored in our publication had at least 7 binding sites, and as these serve as the validation of the algorithm, we define a PRI to be a region in the genome in which at least 7 distinct binding sites are conserved in order and spacing across the mammalian genomes. We then preprocessed every orthlogous gene in the three mammalian genomes defined by NCBI homologene and assembled the PRI database. The PRI website was created to search this database in a variety of ways. For more information on the development and use of the PRI database, please see our work in (*citation*).
2. Search for PRIs with selected transcription factor binding site(s)
This search is intended to answer questions regarding what genes harbor islands
containing binding sites for a selected transcription factor, or combination
of transcription factors, or what other transcription factor binding sites co-occur
in islands with the selected binding site(s). To use this tool, first select
the radio button next to “Search for PRIs with selected transcription
factor binding site(s).” Then, click on the name of a transcription factor
in which you are interested or hold down “Ctrl” to select more than
one transcription factor from the list provided. Click “Submit search.”
The results page is organized as a table with columns for “Island #”,
“Homolo ID”, “Taxon ID”, “Gene Name”, “Gene
ID”, “Binding Site Name”, “TF Name”, “Binding
Site Sequence”, “Chromosome”, “Location”, and
“Is in coding region?” (Figure 1). By default, each PRI is grouped
together (sorted by “Homolo ID”), but you can sort the table with
respect to any of the columns by clicking on the column name (i.e. “Taxon
ID”). For further analysis, this table can easily be copied and pasted
into a spreadsheet program such as Microsoft Excel.
Figure 1:

3. Search for PRIs within a gene of interest
This search is intended to answer questions regarding what PRIs are associated
with a gene of interest and how the PRIs are organized along the chromosome.
To utilize this tool to its full potential, the user should install the SVG
viewer (link here to download?). To begin with, select the radio button next
to “Search for PRIs within a gene of interest.” Then, type the gene
name (as defined by NCBI Gene, aliases accepted) into the field below the text,
and click “Submit search.” The results page initially shows radio
buttons for all PRIs discovered within the 40 kilobase search region and a graphic
display (Figure 2) representing the sequence subset of this 40 kilobase region
that covered by PRIs (numbering is with respect to the annotated translational
start site).
Figure 2:

If you wish to dissect an individual PRI to learn more about the specific binding sites that comprise that PRI, click on either the radio button corresponding to the island number at the top or on the graphic representing the island. When an island is clicked on, a zoomed-in version that is interactive will appear below (Figure 3).
Figure 3:

You can hover over individual binding sites to bring up a window with more information about that binding site (including binding site name, consensus sequence, and genomic location) (Figure 4). To reveal any overlapping binding sites, click on a binding site to reveal any other overlapping binding sites. To view the data in tabular format similar to the results returned with a transcription factor binding site-oriented search (see Topic 2 above), click on “View data in tabular format.” The resulting table is organized with columns for “Island #”, “Homolo ID”, “Taxon ID”, “Gene Name”, “Gene ID”, “Binding Site Name”, “TF Name”, “Binding Site Sequence”, “Chromosome”, “Location”, and “Is in coding region?”. By default, each PRI is grouped together (sorted by “Homolo ID”), but you can sort the table with respect to any of the columns by clicking on the column name (i.e. “Taxon ID”). For further analysis, this table can easily be copied and pasted into a spreadsheet program such as Microsoft Excel.
Figure 4:
4. Search for PRIs within user-defined sequence sets
This search is intended to answer questions about what types of regions conforming to the PRI hypothesis can be found within user-defined sets of 2 or more sequences. One defining feature of this software is that the user can search any sequence from any genomes or from regions of the human, mouse, and/or rat genomes that were not initially covered by the -20 kb to +20 kb extraction of homologene sequences. To utilize this search, two or more sequences in fasta format are pasted into the .Input Sequences. box (Figure 5). The user then defines the minimum PRI size and minimum distinct number of binding sites (Default is 7, please read the manuscript for details). For example, a region with 7 conserved GATA-1 binding sites would not be counted as an island with the default threshold. Although it will fulfill the min. total number of binding sites, it would only be considered as 1 distinct binding site. Clicking on Submit Search. will then submit these sequences to be searched with the PRI algorithm. The output is in SVG format similar to the gene-centric search tool (see .3. Search for PRIs within a gene of interest.).
For more questions regarding this help docs, please email Tom.Cheung@colorado.edu.
| Copyright © 2006 University of Colorado, Boulder. All Rights Reserved. |