GTF format - Flux Capacitor

GTF format

The GTF (Gene Transfer Format) has been developed to facilitate the exchange of genome annotations (i.e., transcripts aligned to the genome) in human readable flat files. A complete description is available for instance at the Washington University or at UCSC.

The standard format description requires 8 mandatory fields which are tab-separated. Following is a list of optional attributes with the structure

key "value"; key2 "value2"; ...

Attention: file sorting that possibly is triggered by the Flux Capacitor or Simulator expects a consistent order of the first 8 fields AND the attribute transcript_id across lines. In concrete, the key transcript_id is expected always in the same field (i.e., column) as it is found in the first line.

Note: in contrast to the general format description, the Flux Capacitor and the Flux Simulator are crucially dependant on the attribute transcript_id which has to be unique on the chromosome a certain transcript has been annotated on (as met by the UCSC standard). The attribute gene_id is not necessary, as both programs perform an intrinsic clustering of transcript into loci, i.e., spliceforms that overlap on the same strand. Further, each transcript requires at least one exon feature. Additional CDS features are optional, and mark the corresponding transcript as coding.

Flux Capacitor

The Flux Capacitor reads reference annotations in GTF format, considering exclusively exon features and transcript_id attributes. Adjacent exons, i.e. exon pairs with the same transcript_id of which one exon starts at the position directly after the end of the other exon, are merged into one exon. This may result in a lower number of exon feature lines in the output. Intrinsically, all transcript models are clustered into loci, i.e., transcripts that overlap on the same strand in the genomic region they align to. The attribute gene_id is adjusted to a UCSC genome browser compatible string chr:start-end and an additional character, W respectively C, indicating the strand.

In the output, the Flux Capacitor writes the following features

Feature	Description
`exon`	Standard GTF compliant GTF feature.
`as_event`	Alternative Splicing events, compliant with the AStalavista definition. For a description of attributes used for `as_event` features, see the AStalavista GTF format description.
`transcript`	A spliceform as annotated in the reference annotation.
`locus`	Spliceforms that overlap on the same strand form a splicing locus.
`segment`	A segment, (part of) an exon.
`junction`	A junction, joining two adjacent exon segments or a splice-junction joining two exon segments across an intron.

The Flux Capacitor can output multiple abundance measures in a 3-token format separated by _{_'' (}underscore''). The 3 tokens denote the base, the resolution, and the measurement of the measurement.

Base	Description
`obs`	Measurement is based on the number of reads that are observed to map to the feature according to the rules described in mapping.
`pred`	Measurement is based on the number of reads that are predicted for the feature after flow network decomposition.
Resolution	Description
all	all reads that overlap the area of the feature (i.e., that fall into one of the possible `slots`) are taken into account for the measurement.
split	reads that stem from the spliceforms listed in the `transcript_id` attribute of the feature are taken into account for the measurement.
uniq	exclusively reads that fall into unique regions of the spliceforms listed in the `transcript_id` attribute are taken into account for the measurement.
Measurement	Description
freq	the absolute number of read(-pairs) that map to the feature.
rfreq	the relative frequency of the feature, i.e., the fraction of read mappings from all read mappings in the experiment.
rcov	the relative coverage, i.e., read mappings `freq` divided by the number of distinct read mappings in the feature. See the description of the `slots` attribute below.

For the calculation of the coverage measures, the Flux Capacitor counts the number of distinct read locations („slots”) in the feature with respect to the corresponding resolution. These numbers are included in the output as 2-token attributes slots_all, slots_split, respectively slots_uniq.

Examples

chrX myGenes transcript 123 455 . + . transcript_id="myTranscript"; slots_all "297"; slots_split "297"; slots_uniq "0"; obs_all_reads "69"; obs_split_reads "31"; obs_uniq_reads "0"; obs_all_rfreq "1.3467e-5"; obs_split_rfreq "6.051e-6"; obs_uniq_rfreq "0"; obs_all_rcov "45.3434"; obs_split_rcov "20.0372"; obs_uniq_rcov "0"; pred_all_reads "73"; pred_split_reads "62"; pred_uniq_reads "3"; pred_all_rfreq "1.4248e-5"; pred_split_rfreq "1.2102e-5"; pred_uniq_rfreq "5.8554e-7"; pred_all_rcov "40.9731"; pred_split_rcov "40.744"; pred_uniq_rcov "1.9715";

Attribute	Value	Description
`slots_all`	297	number of different mappings to the feature
`slots_split`	297	subset of `slots_all` that map to the spliceforms listed in `transcript_id` that fall into the feature
`slots_uniq`	0	subset of `slots_split` that map exclusively to the spliceforms listed in `transcript_id`
`obs_all_reads`	69	read mappings that fall into exonic areas of transcript `myTranscript`
`obs_split_reads`	31	for each segment, the number of `obs_sum_reads` is divided by the number of transcripts there. The sum of these fractions forms the split reads. Use this to assess the difficulty of decomposition.
`obs_uniq_reads`	0	reads that align in unique regions of the transcript
`obs_all_rfreq`	1.3467e-5	`obs_sum_reads` divided by the number of read mappings
`obs_split_rfreq`	6.051e-6	`obs_split_reads` divided by the number of read mappings
`obs_uniq_rfreq`	0	`obs_uniq_reads` divided by the number of read mappings
`obs_all_rcov`	1.3467e-5	`obs_sum_rfreq` divided by `slots_all`
`obs_split_rcov`	6.051e-6	`obs_split_rfreq` divided by `slots_split`
`obs_uniq_rcov`	0	`obs_uniq_rfreq` divided by `slots_uniq`
`pred_all_reads`	73	`pred_split_reads` plus reads that have been predicted for other transcripts in overlapping segments
`pred_split_reads`	62	reads that have been predicted for the transcript. Use this value in combination with `obs_reads` to assess the reliability of the prediction.
`pred_uniq_reads`	3	reads that have been predicted in unique segments. Use this value in combination with `obs_uniq_reads` to assess the reliability of the prediction.
`pred_all_rfreq`	1.4248e-5	`pred_sum_reads` divided by the number of read mappings. Use this value in combination with `obs_sum_rfreq` to assess the reliability of the prediction.
`pred_split_rfreq`	1.2102e-5	`pred_split_reads` divided by the number of read mappings. *Use this value to compare the same* transcript in different experiments.**
`pred_uniq_rfreq`	5.8554e-7	`pred_uniq_reads` divided by the number of read mappings. Use this value in combination with `obs_uniq_rfreq` to assess the reliability of the prediction.
`pred_all_rcov`	40.9731	`pred_sum_rfreq` divided by `slots_all`.
`pred_split_rcov`	40.744	`pred_split_rfreq` divided by `slots_split`. *Use this value to compare different* transcripts accross different experiments.**
`pred_uniq_rcov`	1.9715	`pred_uniq_rfreq` divided by `slots_uniq`. Use this value in combination with `obs_uniq_rpkm` to assess the reliability of the prediction.

Flux Capacitor

Wiki for the FLUX CAPACITOR and FLUX SIMULATOR

Navigation

General

Page tags

Add a new page

Flux Capacitor

Other interesting sites

Terras de Portugal

華居-葳蕤

Albums-template

Nexus Wiki