repository of small ORFs identified by ribosome profiling

Guide to default query interface provides 2 database query interfaces. The default query interface enables quick lookup with limited query possibilities The BioMart implementation was designed for complex querying and export options.

Figure 1: default query interface

The figure above provides a snapshot default query interface and its functionality. Below the distinct query functions are explained and illustrated

Users can choose the number of attributes visualized on each page. By default 10 records are represented. By clicking the "show entries" drop-down button (figure 2) more sORFs can be visualized per page.

Figure 2: "show entries" dropdow button

By clicking on Premade filters dropdown button, additional query options are made available (see Figure 3). These filters allow to search on transcriptn amino-acid sequence or Ensembl gene ID. Also, the user can select to represent only conserved (PhastCon>0.7 and PhyloP>4) or Bazinni sORFs (n frame coverage > 0.1 and ORFscore > 6 ) by checking the corresponding checkboxes. A text box next to the transcript/amino-acid sequence filter allows to input a sequence, this will display sORFs which contain the provided sequence. Important note, these filters do not perform an exact match, all sORFs containing the input sequence will be represented in the ouput.

Figure 3: premade filters

The table header contains the column names of the different attributes that are represented. By 'clicking' once on a column name, the database will sort the different sORFs in ascending order based on this column field. Clicking this column a second time will rearrange the sORFs in descending order. The ordering of the database, based on a specific column, The sorting state is indicated with a blue arrow near the column name. This arrow can be pointing upwards(ascending order), downwards (descending order) or be blank (no ordering based on this column). By default the database is ordered on sORF ID. On figure 4, the data has been sorted by length in ascending order.

Figure 4: sorting data based on length

Column-specific search boxes are included below the database header. These column specific search boxes enable quick filtering on certain values for that column attribute. There are 2 main types of column specific search boxes, text search where the columns query boxes require either text input from the user, or selecting a value from a dropdown list. Additional the "Begin pos" and "End pos" column specific The begin pos and end pos query boxes allow to specify a genomic window for sORFs. The begin pos query box visualizes all sORFs having a start site greater or equall to the specified value (>=), while the end pos queries all sORFs with a stop site smaller or equall to the specified value (<=). If looking for human, intergenic sORFs located on the 7th chromosome situated between nucleotide 160000 and 2000000, the column specific search boxes should be entered as illustrated on figure 5. A table containing detailed information concerning the different column attributes is provided below.

Figure 5: column specific search example

Default query column attributes

sORF ID All sORFs have an unique ID, which starts with the dataset followed by ':' and an incremental assigned number.
ENSEMBL ID All sORFs except for intergenic sORFs have a corresponding Ensembl transcript ID where the sORF was identified. This value is null for intergenic sORFs as no transcript is associated with this sORF. While intergenic sORFs have no ENSEMBL IDs, the distance to the closest up- and down-stream located can be retrieved from the BioMart query interface.
Species Represents the corresponding species
dataset Represents the correponding dataset where the sORF was identified. For more information about the distinct datasets, vist our dataset information page
CHR Chromosome name
Begin pos The genomic start coordinate of the sORF. Start position of the sORF, the query box enable lookup of sORFs where the start position is greater or equal to the specified value (>=)
End pos The genomic end coordinate of the sORF. End position of the sORF, the query box enable lookup of sORFs where the end position is smaller or equal to the specified value (<=)
Length Represents the length of the micropetide (in AA) for the translation product of the sORF A dropdown box allows to select a length range for sORFs.
Annotation Annotation is determined based on the location of the sORF relative to ENSEMBL mRNA annotation. The following annotation are available to chose from:
* exonic (sORFs located in the exonic part of a gene)
* intronic (sORFs located in the intronic part of a gene)
* intergenic (sORFs located between genes)
* ncRNA (sORFs located on non coding RNA)
* 3UTR (sORFs located in the 3'-UTR)
* 5UTR (sORFs located in the 5'-UTR)
* sORF (sORFs corresponding to Ensembl protein coding ORF of less or equall to 100 AA)
* NMD (Nonsense mediated decay, if the coding sequence (following the appropriate reference) of a transcript finishes >50bp from a downstream splice site then it is tagged as NMD) * TEC (To be experimentally confirmed, used for non-spliced EST clusters that have polyA features)
* NSD (non stop decay, ranscripts that have polyA features (including signal) without a prior stop codon in the CDS)
Biotype The Ensembl automatic annotation system classifies genes and transcripts into biotypes including: protein coding, pseudogene, processed pseudogene, miRNA, rRNA, scRNA, snoRNA, snRNA. The biotypes can be grouped into protein coding, pseudogene, long noncoding and short noncoding.

Examples of biotypes in each group are as follows:
* Protein coding: IGC gene, IGD gene, IG gene, IGJ gene, IGLV gene, IGM gene, IGV gene, IGZ gene, nonsense mediated decay, nontranslating CDS, non stop decay, polymorphic pseudogene, TRC gene, TRD gene, TRJ gene.
* Pseudogene: disrupted domain, IGC pseudogene, IGJ pseudogene, IG pseudogene, IGV pseudogene, processed pseudogene, transcribed processed pseudogene, transcribed unitary pseudogene, transcribed unprocessed pseudogene, translated processed pseudogene, TRJ pseudogene, unprocessed pseudogene
* Long noncoding: 3prime overlapping ncrna, ambiguous orf, antisense, antisense RNA, lincRNA, ncrna host, processed transcript, sense intronic, sense overlapping
* Short noncoding: miRNA, miRNA_pseudogene, miscRNA, miscRNA pseudogene, Mt rRNA, Mt tRNA, rRNA, scRNA, snlRNA, snoRNA, snRNA, tRNA, tRNA_pseudogene

NOTE: when a sORF could originate from multiple transcripts with different biotypes, the highest ranked biotype is taken (see INFO page).

More information about the biotype annotation can be found on
Overlap The exon overlap attribute specifies the amount of overlap between a sORF and exon regions of annotated protein coding sequences. The exon overlap attribute ranges between 0 (complete outside exonic regions) and 1 (completly overlapping with exonic regions).

This interface also includes a navigation bar (figure 6). This navigation bar includes information about the currently visualized sORFs on the left side. On figure 8 it states: 'showing 1 to 10 of 185,814 entries', meaning that the first 1 to 10 of 185,814 entries (sORFs) matching the current filters are presented. On the right a navigation function is implemented which allows easy traversal trough multiple pages.

Figure 6: database navigation bar