In order to determine in which genic subregion a certain tract resides, it is necessary to determine where each exon and intron starts and ends. This is done by parsing the annotation file. Consequently, the sequence file is to be entered in this window, attached to its annotation in Genbank, EMBL or DDBJ style, in the flat form format (suffix .gbk). The extracted data will be shown on the sequence output by different background colors for exon +, exon - or intron etc., and will be listed in the Gene table. In case the sequence to be analyzed is different from that attached to the annotation, the sequence can be entered separately in the "alternate sequence entry file " window below, and that sequence will be processed.
Important notes:
Exact numerical
correspondence between the annotation and sequence is
required for inspection of outputs.
Extract using mRNA/CDS
These buttons are on the upper right corner of the Annotated sequence window. Annotation files may contain mRNA feature keys or CDS feature keys or both. In the latter case, the user can choose between the mRNA and CDS (and exon/intron) features. The preferred feature should be mRNA, whenever it is provided.. However, when the mRNA data are incomplete (many mRNA start or end positions are uncertain), or missing (check the annotation file), the CDS data need be entered instead. In that case noncoding mRNA will be read as intergenic ("intercoding" now), with the consequence that 5'and 3' UTR regions will be assigned and counted as intercoding. RNA genes (tRNA, ribosomal etc.) are extracted and listed in both cases. default option is CDS.
Browse
Use this
box to transfer your input annotation file directly, if saved on your machine.
Alternate sequence entry file
If for any reason, the sequence attached to the annotation file is not the required one, you can enter in this window your own sequence separately. You can also process the sequence alone, without annotation. In that case only the "Tracts list", "Tract frequencies" and "Sequence output" outputs will be generated, without sub regional distribution and Gene Tables.
The user can choose between the three possible pairs: R.Y (purine.pyrimidine); K.M (keto.imino) or S;W. The R.Y and K.M tracts can be run together because usually these tracts distribute evenly between the two DNA strands. S and W are better run separately, as weak and strong sequences tend to behave quite differently.
An option "None" is provided, to have the ability to produce a
color-annotated sequence, without any tracts being indicated. On the request of
a referee, an option to run unary sequences (polyA, polyC etc.) has been added.
Match Level
TRACTS can identify also binary tracts in which
a limited percentage of the nonspecified bases are included. Thus on a 90%
match level, one nonspecified base ("nonbase") in ten will be
permitted, e.g. 3 C in an 30 nt R tract. TRACTS will handle nonbase levels down
to 70%, in intervals of 5%. Below 70% tracts will cover most of the genome and
difficulties in calculating expected values are encountered. 100% is the
default level.
Important - when choosing Match level other
then 100% the "Tracts frequencies"
table will not be generated, because expected values calculated here are valid
to 100% only. Those interested in calculating expected values for match levels
less than 100% - contact gad.yagil@weizmann.ac.il .
A line for free text, to enter auxiliary
data of your run and comments, to be displayed on top of the output
reports.
In these five checkboxes you can select the
output files (A-E) you need. For certain inputs not all outputs can be
generated, as described below. All output boxes except "Annotated
Sequence" are marked. The "Annotated Sequence" output is the largest output, so in order to conserve time
(determined mainly by transfer of data rates) this output is left unchecked
by default.
A. Tracts
list
Produces a
list of all tracts above a minimum length, to be entered in the pull down
box on the right. Range is 10 - 50. Default is 15 bases.This feature
is enabled only when a Binary Motif is
chosen.
B. Tract frequencies table
Produces
a table which summarizes the frequencies of the chosen tracts, their found and
expected values, as well as found/expected ratios. Note: The lower limits selected in A., or D., are for display purposes only. For the Tract frequencies table, all
tract lengths are identified and listed.
C. In Sub-region
distribution table
The
sub-regional distributions will be calculated for all tracts equal AND longer
than the tract length selected in the pull down box on the right. This table is
generated only when a Binary Motif is chosen and annotation file is provided.
D. Annotated Sequence
Relists
the sequence, with exons and introns colored in the backgound. The Tracts found
above a chosen minimum length are shown as colored letters. The minimum length
is selected in the pull down box on the right. The lower the number chosen, the
more tracts will be marked on the sequence. Range is 7 - 50. Default is 10 bases
E. GeneTable
A list of
all genes (exons/introns) in the chromosome/contig/scaffolds analyzed. (Note:
an exon/intron list is produced wheter CDS or mRNA is specified)
p> A list of all the tracts, which are
longer then or equal to a given length, as specified by the user during input. The
tracts are listed according to their order of appearance in the sequence. Each line
shows:
B. Tracts frequencies
A table in which the lines show:
· Column 1: The lengths of the tracts, in nt.
· Column 2:
The number of tracts of one member of the binay pair chosen.
· Column 3:
The number of tracts of the other member (this and the next column will
be absent if only a single tract is chosen).
· Column4:
The sum of both members, i.e. the number of tracts of length l found in
the input sequence
· Column 5:
The number of tracts of that length expected in a random DNA sequence of the
same length and base composition as the input sequence, calculated by L(pl
q2+ qlp2).
· Column 6.
The difference between columns 5 and 4
· Column 7.
The number of bases found in tracts of the length listed in column 1
(i.e. column 3 multiplied by tract length l)
· Column 8.
The number of bases expected in random DNA, (i.e. column 4 multiplied by
tract length l).
· Column 9.
The ratio between the number of bases found and the number expected,
i.e. column 5 divided by column 4. This yields the same value as column 8
divided by Column 7, i.e. the value will be the same, whether no. of tracts or
number of bases in these tracts is counted. The ratio is the best indicator of
under- or over-representatiom of the binary tracts.
· Column
10. No of found bases which are equal AND longer (GE) than the length
listed in Column1
· Column
11. no. of expected bases in random DNA GE than l. (For formula see
paper).
· Column
12. The ratio of found/expected GE l values (i.e. column 11 divided by
column 10).
The
numbers of tracts or bases expected (and ratio values) in this table are valid
only when a match level of 100% is specified (see: Options); if other match
evel values are specified, the table will not be generated.
C. Sub region distributions
A
summary table showing:
· Number of
bases in exons, introns, and intergenic regions in the input sequence.
· Percentage of exons, introns, and intergenic
regions in the input sequence.
·
Number of bases found in tracts of each genomic sub region. The numbers
shown are for tracts equal AND longer than the length selected by the user.
·
Number of bases expected in tracts of each genomic sub region.
This table can be generated only when the annotation data gives the
subregional composition information.
The full sequence analyzed, in a convenient
100 base format (in "blocks" of 10). Found tracts have their letters
colored according to their binary motif. Exons and introns are indicated by
their background colors; Introns are in italics. The minimum tract
length to be colored is user selected. Moving the mouse over a colored region
will show a tool tip indicating the gene name, gene product (function), sub
region type and where the region starts and ends. By pressing on the mouse
while pointing at the region, the display jumps to the corresponding entry in
the Gene table.
E. Gene table
A
one-line summary of: each exon and intron of all genes (RNA's) extracted from
the annotated sequence. The lines shows:
This table can be generated only if the
annotation data supplies the regional information.s