GTF format

The GTF (Gene Transfer Format) has been developed to facilitate the exchange of genome annotations (i.e., transcripts aligned to the genome) in human readable flat files. A complete description is available for instance at the Washington University or at UCSC.

The standard format description requires 8 mandatory fields which are tab-separated. Following is a list of optional attributes with the structure

key "value"; key2 "value2"; ...

Attention: file sorting that possibly is triggered by the Flux Capacitor or Simulator expects a consistent order of the first 8 fields AND the attribute transcript_id across lines. In concrete, the key transcript_id is expected always in the same field (i.e., column) as it is found in the first line.

Note: in contrast to the general format description, the Flux Capacitor and the Flux Simulator are crucially dependant on the attribute transcript_id which has to be unique on the chromosome a certain transcript has been annotated on (as met by the UCSC standard). The attribute gene_id is not necessary, as both programs perform an intrinsic clustering of transcript into loci, i.e., spliceforms that overlap on the same strand. Further, each transcript requires at least one exon feature. Additional CDS features are optional, and mark the corresponding transcript as coding.

Flux Capacitor

The Flux Capacitor reads reference annotations in GTF format, considering exclusively exon features and transcript_id attributes. Adjacent exons, i.e. exon pairs with the same transcript_id of which one exon starts at the position directly after the end of the other exon, are merged into one exon. This may result in a lower number of exon feature lines in the output. Intrinsically, all transcript models are clustered into loci, i.e., transcripts that overlap on the same strand in the genomic region they align to. The attribute gene_id is adjusted to a UCSC genome browser compatible string chr:start-end and an additional character, W respectively C, indicating the strand.

In the output, the Flux Capacitor writes the following features

Feature Description
exon Standard GTF compliant GTF feature.
as_event Alternative Splicing events, compliant with the AStalavista definition. For a description of attributes used for as_event features, see the AStalavista GTF format description.
transcript A spliceform as annotated in the reference annotation.
locus Spliceforms that overlap on the same strand form a splicing locus.
segment A segment, (part of) an exon.
junction A junction, joining two adjacent exon segments or a splice-junction joining two exon segments across an intron.

The Flux Capacitor can output multiple abundance measures in a 3-token format separated by _'' (underscore''). The 3 tokens denote the base, the resolution, and the measurement of the measurement.

Base Description
obs Measurement is based on the number of reads that are observed to map to the feature according to the rules described in mapping.
pred Measurement is based on the number of reads that are predicted for the feature after flow network decomposition.
Resolution Description
all all reads that overlap the area of the feature (i.e., that fall into one of the possible slots) are taken into account for the measurement.
split reads that stem from the spliceforms listed in the transcript_id attribute of the feature are taken into account for the measurement.
uniq exclusively reads that fall into unique regions of the spliceforms listed in the transcript_id attribute are taken into account for the measurement.
Measurement Description
freq the absolute number of read(-pairs) that map to the feature.
rfreq the relative frequency of the feature, i.e., the fraction of read mappings from all read mappings in the experiment.
rcov the relative coverage, i.e., read mappings freq divided by the number of distinct read mappings in the feature. See the description of the slots attribute below.

For the calculation of the coverage measures, the Flux Capacitor counts the number of distinct read locations („slots”) in the feature with respect to the corresponding resolution. These numbers are included in the output as 2-token attributes slots_all, slots_split, respectively slots_uniq.

Examples

chrX myGenes transcript 123 455 . + . transcript_id="myTranscript"; slots_all "297"; slots_split "297"; slots_uniq "0"; obs_all_reads "69"; obs_split_reads "31"; obs_uniq_reads "0"; obs_all_rfreq "1.3467e-5"; obs_split_rfreq "6.051e-6"; obs_uniq_rfreq "0"; obs_all_rcov "45.3434"; obs_split_rcov "20.0372"; obs_uniq_rcov "0"; pred_all_reads "73"; pred_split_reads "62"; pred_uniq_reads "3"; pred_all_rfreq "1.4248e-5"; pred_split_rfreq "1.2102e-5"; pred_uniq_rfreq "5.8554e-7"; pred_all_rcov "40.9731"; pred_split_rcov "40.744"; pred_uniq_rcov "1.9715";

Attribute Value Description
slots_all 297 number of different mappings to the feature
slots_split 297 subset of slots_all that map to the spliceforms listed in transcript_id that fall into the feature
slots_uniq 0 subset of slots_split that map exclusively to the spliceforms listed in transcript_id
obs_all_reads 69 read mappings that fall into exonic areas of transcript myTranscript
obs_split_reads 31 for each segment, the number of obs_sum_reads is divided by the number of transcripts there. The sum of these fractions forms the split reads. Use this to assess the difficulty of decomposition.
obs_uniq_reads 0 reads that align in unique regions of the transcript
obs_all_rfreq 1.3467e-5 obs_sum_reads divided by the number of read mappings
obs_split_rfreq 6.051e-6 obs_split_reads divided by the number of read mappings
obs_uniq_rfreq 0 obs_uniq_reads divided by the number of read mappings
obs_all_rcov 1.3467e-5 obs_sum_rfreq divided by slots_all
obs_split_rcov 6.051e-6 obs_split_rfreq divided by slots_split
obs_uniq_rcov 0 obs_uniq_rfreq divided by slots_uniq
pred_all_reads 73 pred_split_reads plus reads that have been predicted for other transcripts in overlapping segments
pred_split_reads 62 reads that have been predicted for the transcript. Use this value in combination with obs_reads to assess the reliability of the prediction.
pred_uniq_reads 3 reads that have been predicted in unique segments. Use this value in combination with obs_uniq_reads to assess the reliability of the prediction.
pred_all_rfreq 1.4248e-5 pred_sum_reads divided by the number of read mappings. Use this value in combination with obs_sum_rfreq to assess the reliability of the prediction.
pred_split_rfreq 1.2102e-5 pred_split_reads divided by the number of read mappings. Use this value to compare the same transcript in different experiments.
pred_uniq_rfreq 5.8554e-7 pred_uniq_reads divided by the number of read mappings. Use this value in combination with obs_uniq_rfreq to assess the reliability of the prediction.
pred_all_rcov 40.9731 pred_sum_rfreq divided by slots_all.
pred_split_rcov 40.744 pred_split_rfreq divided by slots_split. Use this value to compare different transcripts accross different experiments.
pred_uniq_rcov 1.9715 pred_uniq_rfreq divided by slots_uniq. Use this value in combination with obs_uniq_rpkm to assess the reliability of the prediction.

Add a New Comment
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License