Expression of genes and transcripts

Requires: PRO_FILE_NAME Column 1-4, NB_MOLECULES, EXPRESSION_K, EXPRESSION_X1
Outputs: PRO_FILE_NAME Column 5 (relative abundance) and 6 (molecule count), both after gene expression

The cell group of the experiment is assigned a random expression profile where not necessarily all transcripts of the reference are expressed. Expression levels $y$ are connected with the relative expression rank $x$ by a mixed power / exponential law of the general form

(1)
\begin{align} y=\left(\frac{x}{x_0}\right)^k exp^{-\left(\frac{x}{x_1}\right)-\left(\frac{x}{x_1}\right)^2} \end{align}

where $x$ denotes the rank number of a gene, $x_0$ the expression level of the highest abundant gene, and $k$ and $x_1$ are cell type and/or experiment specific parameters. In the simulation, expression ranks are uniformly random assigned to transcripts of the reference, and subsequently their expression level in number of molecules and relative abundance is determined according to the law in Formula 1. Certainly, according to the corresponding settings a more or less substantial part of the transcripts from the reference annotation will remain unexpressed in the simulated run.

After the number of RNA molecules has been determined for each transcript, in silico expressed transcripts are assigned individual variations in transcription start and the length of the attached poly-A tail. The FLUX SIMULATOR modeles differences in transcription start are modelled by random variables under an exponential model with a mean around 10nt. During poly-adenylation in the nucleus usually 200-250 adenine residues get added to the primary transcript. Disregarding other poly-adenylation mechanisms, as cytoplasmatic polyadenylation, and the exact mechanisms of degrading processes by exo- and endonucleases, our model describes poly-A lengths by randomly sampling under a Gaussian distribution with a mean of 125nt and shape adapted s.t. >99.5% of the random variables fall in the interval [0;250].