Abundance Measures

## Measurements

Different measures can be thought of for measuring abundances of RNA features, in general loci, genes, transcripts, exons, or alternative splicing events. In the following the terms read and mapping are used as synonyms, however, we have to bear in mind that all observed read alignments are mappings, and slackly using reads instead does only hold for datasets with exclusively one alignment per read.

### (Absolute) Frequency

We use the terms frequency and absolute frequency equivalently to describe the amount of observations $o$ for a certain feature, i.e., an exon, transcript or gene. This number is directly derived from the number of reads that maps to the corresponding feature. Certainly, as Next Generation Sequencing technologies adopt an intermediary step of library construction, frequency measures are biased by the size of a certain feature, as well as the experiment size.

### Coverage

We adopt the term of coverage to describe the occupance of features (i.e., exons, transcripts, genes, events, etc. …) by sequencing reads. Straightforwardly, this can be done in two ways, by measuring nucleotide coverage, or read coverage. As different sequencing experiments produce reads of different lengths, and also datasets with mixed read lengths are to be considered, the read coverage is a universal way to measure abundances of RNA molecules.

Definition: Read coverage $c$ is the number of observed reads $o$ aligning to a certain feature
divided by the number hypothetically possible different reads $s$ in the feature: $c=\frac{o}{s}$

By above stated definition, the Flux Capacitor determines $c$ by the reads that map to a certain feature $o$ out of the number of different mapping possibilities (i.e., $s$). By this, coverage can be calculated for (linear) sequences (e.g., exons and transcripts), and for constructs of partially overlapping or disconnected sequences (as for instance in alternatively spliced genes). The basic idea follows along the lines of the coverage measure proposed in there. It generalizes the measurements focused on linear stretches of sequences as described there.

Coverage measures naturally normalize for the extend of a certain feature, and consequently one can compare coverages of features with different sizes. However, a llinear correlation between observations and size is assumed intrinsically in the fraction $\frac{o}{s}$.

Variable $o$ is denoted by freq in the Flux Capacitor's output. For a complete list of all possible abundance tags, see the GTF format description.

## Experiment Size Normalization

In order to compare experiments of different sizes, both measurements can be relativized to the number of reads from which they have been derived. It remains important to consider which number of reads are the basis for the comparisons, possibilities here are the total number of reads in the experiment $n_{exp}$ (before or after a potential quality filtering process), the number of reads $n_{dna}$ that can be mapped to a reference genome, or the number of reads $n_{rna}$ that can be mapped to a reference transcriptome.

Clearly $n_{exp}$ does not contain information about the quality of the reads, in terms of mismatches when aligned to a reference. Therefore, in different technical replicates the number may bias for a technical better run in comparison to one where less reads can be mapped due to high error rates. Furthermore, sample contamination is not considered by a normalization over $n_{exp}$, which in contrast can be obtained when considering the number of sequences that can be mapped to a reference genome $n_{dna}$. But $n_{dna}$ still does not comprise information about the fraction of "undesired" reads as the ones derived from ribosomal RNA or unspliced transcript - which can substantially vary between experiment depending on the applied protocols. In fact, when comparing mRNA frequencies it seems to be most senseful to take $n_{rna}$ of a considered transcriptome sequence into account. By this naturally also biases from a different degree of incompleteness when comparing against different annotations is balanced.

We therefore define relative frequency $rfreq$ and relative coverage $rcov$ as follows

$rfreq=\frac{o}{n_{rna}}$

and

$rcov=\frac{r}{n_{rna}}$

For single reads, our $rcov$ measurement is close to the RPKM measure (reads per kilobase per million mapped reads). The differences are that (i) not the transcribed length, but merely the number of different alignment positions ("slots") is taken into account ($\textrm{slots= length- readlength}$), and (ii) the RPKM measure scales the numeric space of the obtained values up by 109. In order to produce measurements comparable to the currently popular RPKM values, the Flux Capacitor produces $rcov$ measures shifted by the factor explained in (ii).

## Scope of measurement

The section before introduced the different measures of frequency and coverage, and their counterparts relative frequency and relative coverage after normalization according to the size of the respective experiment. All these measures have one component in common, the number of read mappings counted as frequency. Read mappings are based on a certain base (see below) and can be counted in different scopes considering a certain feature, i.e., an exon, transcript, etc. …

The Flux Capacitor considers two bases, i.e., observation obs and prediction pred. Base obs is the observation after mapping to the reference annotation, and distributes reads equally amongst overlapping features (a trivial deconvolution algorithm, so to say). Base pred considers the values after flow network deconvolution of the reads. Both bases are considered in 3 scopes: all - the number of all mapped reads that fall into the feature, split - the number of read mappings from all that are assigned to the transcript(s) listed in the transcript_id field of the feature, and unique - the subset of mappings in split that are in regions where exactly and exclusively the transcript(s) of the transcript_id field are annotated.

Here a toy example for the different measurements. The figure sketches a locus L, with two transcripts T1 and T2 and 3 exons E1, E2 and E3 of which E2 is shared by both transcripts. In total, 6 reads align within the 3 exons (splice junction mappings are not shown for simplification). We count the following frequencies:

feature transcript ID measurement value
E1 T1 obs_freq_all 2
obs_freq_split 2
obs_freq_unqiue 2
E2 T1,T2 obs_freq_all 3
obs_freq_split 1.5
obs_freq_unqiue 0
E3 T2 obs_freq_all 1
obs_freq_split 1
obs_freq_unqiue 1
T1 T1 obs_freq_all 5
obs_freq_split 3.5
obs_freq_unqiue 2
T2 T2 obs_freq_all 4
obs_freq_split 2.5
obs_freq_unqiue 1
L T1,T2 obs_freq_all 6
obs_freq_split 6
obs_freq_unqiue 3

As by definition of the measurement, all equals split for the locus. Moreover, the unique measure counts the sum of read mappings in regions, where all of the transcripts in the locus are present. Consequently, these exclude reads from regions that are unique to a transcript, or that are unique to a subset of transcripts in the locus. To this end, the tag unique may be misleading for a locus.