Flux Capacitor Manual

Version

New Features

Bugs

To Do

The list is not sorted by priority

  • output reads observed in Introns
  • command-line flag to deactivate prediction / deconvolution
  • enable gene_id for sub-locus information

To comment on the To Do list, go to the details page.

Usage

usage: flux

Parameter Parameter Default Description
-cb,--costBounds Float,Float cost boundaries in the form of $\mathit{factor_{min},factor_{max}}$ where $\mathit{factor_{min}}$ is the factor to determine the minimum (and $\mathit{factor_{max}}$ correspondingly the factor for obtaining the maximum) number of reads that can be predicted with respect to the originally observed reads. See the page about the objective function for details about the cost boundaries.
-cm,--costModel lin|log[Float] log cost model, either linear or logarithmical. In the case of linear costs, the a slope different than 1 can be controlled by the optional Float argument.
-cs,--costSplit Integer 5 the number of linear segments that approximate the function underlying the cost model
-d,--decompose perform the flow network decomposition. Important: in Capacitor 20090718 there is no output produced if this option is not applied.
-f,--force suppresses communication on stderr
-g,--graph outputs graph information, i.e., GTF features fragment and junction
-i,--lib String path to the lpsolve native libraries
-l,--locus additionally output elements of the locus, i.e., GTF features locus and transcript
-m,--map perform the mapping (i.e., "profiling") step. Important: in Capacitor 20090718 this step alone does not produce any output. If applied, it collects information about the read distribution, otherwise the assumption is uniformally distributed reads.
-o,--out String stdout write output to a file with the given path
-p,--pair Integer,Integer} activate paired-end, specify insert size range by $\mathit{size_{min},size_{max}}$
-r,--ref String Mandatory: file with the reference annotation (GTF format)
-s,--sra String Mandatory: file with the short read alignment (BED format)
-t,--strand activate strand information

The Segment Graph

1

Read mapping

Single read mapping

The assignment of reads — after having mapped them to genomic locations — is not straightforward. The Flux Capacitor follows a conservative annotation assignment,i.e., reads are assigned uniquely to genomic regions („segments” or ,,junctions). These regions are defined given the exon-intron structure of each locus, an example is shown in Fig.1.

read_location4
Fig.1: An example locus with two transcripts $\textrm{I}$ and $\textrm{II}$ (names to the left) that overlap in segments of their exons (green boxes denoted by letters A through E, indices indicate segments of overlapping exons). The Flux Capacitor distinguishes further 5 non-exonic areas. 19 sequencing reads (arrows with heart labels) have been mapped in the arrea of the locus as shown.

The locus sketched in Fig.1 consists of 8 exons that cluster in 8 segments (A1, A2,$\ldots$,E) separated by 5 non-exonic regions, i.e., the 5'proximal area (F), 3 introns (G,H,J), and 3'proximal (K). Additionally, there exist junctions between all adjacent segments (e.g., FA1, A1A2, etc. $\ldots$), or between non-adjacent segments that are spliced together (so-called splice-junctions, for instance A2B1). Reads are assigned to the region they completely fall into.

category FA1 A1A2 A2 G GB1 B1B2 B1C1 B2H C1 C1C2 C2 C3 C3J J E EK none
assigned
read ID
1 2 3, 19 18 4 5 17 6 7, 16 15 8 14 9 10 11 12 13

Note: By meanings of the mapping, read number 13 is not compatible with the annotation and remains unassigned.

Read pair mapping

A read pair is mapped validly iff both mate reads map to a segment or junction and their mapping distance on at least one of the transcripts that support both mapping locations falls within the boundaries of expected insert sizes. How paired reads are counted and coverage by read pairs is determined summarizes Fig.2.

how_to_count_reads.png

Fig.2: Examples of exonic structures (green boxes are exons, introns are not drawn to scale) and distinct possible read mappings, for single (above the structure) and paired-end reads (below). The read length is 3 and, for paired-ends, the insert size is 4 (no variation). For simplification, junctions are not shown. (A) There are 10 possible mapping locations („slots”)) in a mono-exonic transcript with 12nt. Reads starting at positions 11 or 12 fall partially outside of the annotation, as reads that start before position 1, and such reads are not considered to belong to the exon as annotated. Correspondingly, 4 slots with paired end reads can be observed. (B) Example of a transcript with 2 exons. Disconsidering the splice-junction, which is assigned read mappings starting in position 6 or 7, we observe 8 slots for single reads and 3 paired-end read slots. (C) Example of a transcript with 3 exons (splice-junctions disregarded). There are 7 slots for single reads, and 2 for paired-end reads.

Flow Network Decomposition

Abundance Measures

Measurements

Different measures can be thought of for measuring abundances of RNA features, in general loci, genes, transcripts, exons, or alternative splicing events. In the following the terms read and mapping are used as synonyms, however, we have to bear in mind that all observed read alignments are mappings, and slackly using reads instead does only hold for datasets with exclusively one alignment per read.

(Absolute) Frequency

We use the terms frequency and absolute frequency equivalently to describe the amount of observations $o$ for a certain feature, i.e., an exon, transcript or gene. This number is directly derived from the number of reads that maps to the corresponding feature. Certainly, as Next Generation Sequencing technologies adopt an intermediary step of library construction, frequency measures are biased by the size of a certain feature, as well as the experiment size.

Coverage

We adopt the term of coverage to describe the occupance of features (i.e., exons, transcripts, genes, events, etc. …) by sequencing reads. Straightforwardly, this can be done in two ways, by measuring nucleotide coverage, or read coverage. As different sequencing experiments produce reads of different lengths, and also datasets with mixed read lengths are to be considered, the read coverage is a universal way to measure abundances of RNA molecules.

Definition: Read coverage $c$ is the number of observed reads $o$ aligning to a certain feature
divided by the number hypothetically possible different reads $s$ in the feature: $c=\frac{o}{s}$

By above stated definition, the Flux Capacitor determines $c$ by the reads that map to a certain feature $o$ out of the number of different mapping possibilities (i.e., $s$). By this, coverage can be calculated for (linear) sequences (e.g., exons and transcripts), and for constructs of partially overlapping or disconnected sequences (as for instance in alternatively spliced genes). The basic idea follows along the lines of the coverage measure proposed in there. It generalizes the measurements focused on linear stretches of sequences as described there.

Coverage measures naturally normalize for the extend of a certain feature, and consequently one can compare coverages of features with different sizes. However, a llinear correlation between observations and size is assumed intrinsically in the fraction $\frac{o}{s}$.

Variable $o$ is denoted by freq in the Flux Capacitor's output. For a complete list of all possible abundance tags, see the GTF format description.

Experiment Size Normalization

In order to compare experiments of different sizes, both measurements can be relativized to the number of reads from which they have been derived. It remains important to consider which number of reads are the basis for the comparisons, possibilities here are the total number of reads in the experiment $n_{exp}$ (before or after a potential quality filtering process), the number of reads $n_{dna}$ that can be mapped to a reference genome, or the number of reads $n_{rna}$ that can be mapped to a reference transcriptome.

Clearly $n_{exp}$ does not contain information about the quality of the reads, in terms of mismatches when aligned to a reference. Therefore, in different technical replicates the number may bias for a technical better run in comparison to one where less reads can be mapped due to high error rates. Furthermore, sample contamination is not considered by a normalization over $n_{exp}$, which in contrast can be obtained when considering the number of sequences that can be mapped to a reference genome $n_{dna}$. But $n_{dna}$ still does not comprise information about the fraction of "undesired" reads as the ones derived from ribosomal RNA or unspliced transcript - which can substantially vary between experiment depending on the applied protocols. In fact, when comparing mRNA frequencies it seems to be most senseful to take $n_{rna}$ of a considered transcriptome sequence into account. By this naturally also biases from a different degree of incompleteness when comparing against different annotations is balanced.

We therefore define relative frequency $rfreq$ and relative coverage $rcov$ as follows

$rfreq=\frac{o}{n_{rna}}$

and

$rcov=\frac{r}{n_{rna}}$

For single reads, our $rcov$ measurement is close to the RPKM measure (reads per kilobase per million mapped reads). The differences are that (i) not the transcribed length, but merely the number of different alignment positions ("slots") is taken into account ($\textrm{slots= length- readlength}$), and (ii) the RPKM measure scales the numeric space of the obtained values up by 109. In order to produce measurements comparable to the currently popular RPKM values, the Flux Capacitor produces $rcov$ measures shifted by the factor explained in (ii).

Scope of measurement

The section before introduced the different measures of frequency and coverage, and their counterparts relative frequency and relative coverage after normalization according to the size of the respective experiment. All these measures have one component in common, the number of read mappings counted as frequency. Read mappings are based on a certain base (see below) and can be counted in different scopes considering a certain feature, i.e., an exon, transcript, etc. …

The Flux Capacitor considers two bases, i.e., observation obs and prediction pred. Base obs is the observation after mapping to the reference annotation, and distributes reads equally amongst overlapping features (a trivial deconvolution algorithm, so to say). Base pred considers the values after flow network deconvolution of the reads. Both bases are considered in 3 scopes: all - the number of all mapped reads that fall into the feature, split - the number of read mappings from all that are assigned to the transcript(s) listed in the transcript_id field of the feature, and unique - the subset of mappings in split that are in regions where exactly and exclusively the transcript(s) of the transcript_id field are annotated.

all_split_unique

Here a toy example for the different measurements. The figure sketches a locus L, with two transcripts T1 and T2 and 3 exons E1, E2 and E3 of which E2 is shared by both transcripts. In total, 6 reads align within the 3 exons (splice junction mappings are not shown for simplification). We count the following frequencies:

feature transcript ID measurement value
E1 T1 obs_freq_all 2
obs_freq_split 2
obs_freq_unqiue 2
E2 T1,T2 obs_freq_all 3
obs_freq_split 1.5
obs_freq_unqiue 0
E3 T2 obs_freq_all 1
obs_freq_split 1
obs_freq_unqiue 1
T1 T1 obs_freq_all 5
obs_freq_split 3.5
obs_freq_unqiue 2
T2 T2 obs_freq_all 4
obs_freq_split 2.5
obs_freq_unqiue 1
L T1,T2 obs_freq_all 6
obs_freq_split 6
obs_freq_unqiue 3

As by definition of the measurement, all equals split for the locus. Moreover, the unique measure counts the sum of read mappings in regions, where all of the transcripts in the locus are present. Consequently, these exclude reads from regions that are unique to a transcript, or that are unique to a subset of transcripts in the locus. To this end, the tag unique may be misleading for a locus.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License