Flux Simulator Manual

# Version History

## New Features

• FLUX-9: Splice Junction extractor. The simulator can extract the sequence of splice junctions — hypothetical and real — from an input annotation, applying a certain intron model.
• FLUX-8: Command line interface — the simulator can be run in command line mode now. There are 3 flags, "-x" for simulating the expression, "-l" for simulated library construction, and "-s" for simulated sequencing. The graphical user interface can be activated with "-X".
• FLUX-6: PRO file now contains the abundance of each transcript after sequencing (column 9 and 10).
• BED file: read name identifiers now comply with the FMRD format.

## Bugfixes

• FLUX-4: biases in read sampling
• FLUX-5: parameter READ_NUMBER 0

# To Do

The list is not sorted by priority

• optimize sequencing speed

To comment on the To Do list, go to the details page.

# Usage

See the usage of the Flux Simulator interface.

# Simulating RNAseq experiments

Let me start what this program does not want to be: the simulation will not give you an identical reproduction of the processes that happen in the corresponding experiments. Hence, do not assume that the output of the program can substitute real data for purposes of training. The Flux Simulator gives an impression of differential systematic biases that can be introduced in biological experiments. It has been designed as an in silico benchmarking environment that - as close as current knowledge permits - mimics scenarios that happen in RNAseq experiments. It can also be seen as a tool to understand laboratory techniques — as such often lack data from intermediate steps — and hopefully helps to improve these. The models described here-in hopefully also get improved as more real data sets become available.

Fig.1: The 4 main steps of the Flux Simulation pipeline are depicted in different tab panes in the graphical user interface.

The Flux Simulator separates each RNAseq experiment — a so-called "run" — in 3 different steps, i.e., gene expression, library construction and sequencing. Together with the initial assumption of an reference annotation, i.e., transcript structures, the program comprises 4 steps which combinely are representing the consecutive pipeline (from left to right in Fig.1) of a run, described by a specific parameter file (usually with the extension .par, see formats).

The status of each step is indicated by a color dot, grey means the module is missing data or obligatory parameters, red means input data/parameters are ready, but the simulated experiment has not yet been performed and green means the corresponding step has been finished. "Run" and "Stop" buttons refer to the currently selected tab and initiate or end the corresponding simulation step. A typical starting point is to either load and copy data of an existing Run, or, to start creating a new project specifying location of the annotation and project files. Finally, at the bottom of the window there is a status bar which indicates the progress while simulations are carried out.

## Reading the reference annotation

Requires: REF_FILE_NAME, LOAD_CODING, LOAD_NONCODING
Outputs: PRO_FILE_NAME Column 1 (locus name), 2 (transcript name), 3 (CDS/NC) and 4 (spliced length)

The necessary first step in order to simulate the experiment is the loading of a reference annotation. Input data has to be in GTF format at the path specified by REF_FILE_NAME. Each transcript has to have "exon" features, LOAD_CODING takes into account the ones that have additionally "CDS" features, LOAD_NONCODING those which don't. Initiating the reading of the reference annotation (button "Run" in the toolbar) first causes a check whether the GTF structure is well sorted for efficiency of the subsequent operations. In case, the FLUX SIMULATOR will sort your file in the temporary directory, and subsequently a sorted copy of the file (with the suffix "_sorted" before the extension) should appear in the folder containing the project. Make sure to use sorted files instead of the original files in future runs, because file sorting can contribute a substantial part of the running time.

Upon termination of reading and parsing the annotation, you see several statistics including a histogram of spliced transcript lengths (upper panel) and a zoom-in onto the first 3 quartiles (lower panel). This step also initiates the pro file by writing the first 4 columns, i.e., splice locus ID, transcript ID, CDS/NC and spliced transcript length.

# Expression of genes and transcripts

Requires: PRO_FILE_NAME Column 1-4, NB_MOLECULES, EXPRESSION_K, EXPRESSION_X1
Outputs: PRO_FILE_NAME Column 5 (relative abundance) and 6 (molecule count), both after gene expression

The cell group of the experiment is assigned a random expression profile where not necessarily all transcripts of the reference are expressed. Expression levels $y$ are connected with the relative expression rank $x$ by a mixed power / exponential law of the general form

(1)
\begin{align} y=\left(\frac{x}{x_0}\right)^k exp^{-\left(\frac{x}{x_1}\right)-\left(\frac{x}{x_1}\right)^2} \end{align}

where $x$ denotes the rank number of a gene, $x_0$ the expression level of the highest abundant gene, and $k$ and $x_1$ are cell type and/or experiment specific parameters. In the simulation, expression ranks are uniformly random assigned to transcripts of the reference, and subsequently their expression level in number of molecules and relative abundance is determined according to the law in Formula 1. Certainly, according to the corresponding settings a more or less substantial part of the transcripts from the reference annotation will remain unexpressed in the simulated run.

After the number of RNA molecules has been determined for each transcript, in silico expressed transcripts are assigned individual variations in transcription start and the length of the attached poly-A tail. The FLUX SIMULATOR modeles differences in transcription start are modelled by random variables under an exponential model with a mean around 10nt. During poly-adenylation in the nucleus usually 200-250 adenine residues get added to the primary transcript. Disregarding other poly-adenylation mechanisms, as cytoplasmatic polyadenylation, and the exact mechanisms of degrading processes by exo- and endonucleases, our model describes poly-A lengths by randomly sampling under a Gaussian distribution with a mean of 125nt and shape adapted s.t. >99.5% of the random variables fall in the interval [0;250].

# Library construction

Requires: PRO_FILE_NAME Column 1-6, RT_PRIMER, RT_MIN, RT_MAX, FRAGMENTATION, FRAG_B4_RT, FRAG_MODE, FRAG_LAMBDA, FILTERING, FILT_MIN, FILT_MAX
Outputs: LIB_FILE_NAME

## Reverse transcription

This step simulates the reverse transcription (RT) and an optional fragmentation or filtering process which can influence the outcome of the run substantially. You may either choose a random priming strategy or poly-dT primers (RT_PRIMER), but you have to provide the maximum and minimum length of the expected RT products (RT_MIN, RT_MAX) — which certainly are a function of the specific RT protocol (enzymes and timing) applied. If the RT protocol is POLY-DT primed, a single successful priming event on each RNA molecule is assumed — which probably does not reflect reality, but prevents from a statistical even spread of molecule loss. In the case of RANDOM priming, the number of successful priming events, i.e., primers that recruit a reverse transcriptase, for a certain $RNA_x$ molecule is drawn from sampling poisson distribution with mean

(2)
\begin{align} \mu= \frac{n* length(RNA_x)}{\sum_{i=1}^n length(RNA_i)} \end{align}

Eq.(2) compares $length()$ of the considered molecule $RNA_x$ to the average length of the RNA molecules in the reaction. To prevent the loss, especially of shorter transcripts, that may incur due to molecule numbers that are lower than in real experiments, at least one successful priming event is enforced on every RNA molecule. The length of the generated RNA molecule then is determined by an uniform randomly distributed variable $U= [0;1[$.

(3)
$$RT_{min} + U * (RT_{max}- RT_{min})$$

In the case of multiple priming events are extended along the same RNA molecule, upstream primed first strand cDNA synthesis can displace further downstream bound primers. The chance of for a DNA-RNA hybrid to be displaced is most likely a function of its distance $dist$ to the closest upstream priming event. The FLUX SIMULATOR currently decides on the displacement of overlapping RT extensions in a Bernoulli trial. To be specific, downstream primers get displaced if for a uniformly drawn random variable $U=[0;1[$ holds

(4)
\begin{align} U> \frac{dist}{RT_{min}}- 1 \end{align}

## Fragmentation and filtering

Next, an optional fragmentation process (nebulization, hydrolysis, adaptive focusing acoustics, …) may be comprised in the simulation. In general the simulation distinguishes 2 different mechanisms of RNA degradation, a mechanical/physical breaking process (PHYSICAL) and cleavage that is less dependant on physical properties (CHEMICAL). Choice of the fragmentation nature will influence the distribution of fragment lengths after the simulated fragmentation. Furthermore, also dependant on the adopted method, you should provide a realistic estimation of the maximum molecule length (FRAG_LAMBDA) that is not broken in the applied protocol. For instance, ~500nt cDNA molecules are known to be problematic to break with usual nebulization strategies. Naturally, this will mark about the upper limit of the fragment distribution yielded.

Simulation of fragmentation is an iterative process, where in each round a fragment is assigned a certain breaking probability, for PHYSICAL fragmentation

(5)
\begin{align} P_b= 1-exp^{-\frac{length(cDNA)}{FRAG\_LAMBDA}} \end{align}

and for CHEMICAL fragmentation

(6)
\begin{align} P_b= 1-(length(cDNA)- \lambda)^{-2} \end{align}

On the occurrences of breaks is decided in Bernoulli trials, the location of the respective breakpoint is normally distributed around the middle of the molecule (PHYSICAL), respectively uniformly distributed along the molecule (CHEMICAL). Finally, you specify whether the fragmentation step is carried out after or before the reverse transcription from RNA to DNA. Finally, in some protocols there is a step after RT and fragmentation that filters the generated cDNA fragments by size (FILTERING). If so, provide the minimum , (FILT_MIN) and the maximum (FILT_MAX) length of the fragments you want to retain for the sequencing.

When started ("Run" in the toolbar), the FLUX SIMULATOR will show you the number of initial molecules, as well as the number of molecules after each of the steps you programmed. In the gel, you can follow the length distribution of your molecules during the multi-step process. Additionally, a summary of the distribution of subsequently generated reads along the original transcript molecules is shown by 3x3 plots (so, in total 9 plots): 3 for short (<1500 nt), 3 for medium (<3000 nt, and longer than 1500, clearly), 3 for long (>=3000 nt) transcripts, and 3 for low (up to ~15 molecules/cell), 3 for medium (>15 and up to ~500 molecules per cell) and 3 for highly expressed ones (its still only 9 plots, check in the program).

Please note that the final number of molecules you obtain provides an upper limit on your sequencing capacity, as over sampling a small amount of molecules will not enlarge the diversity in the produced reads — it means, if you would produce a 1000 reads from 10 molecules left after RT/fragmentation, you will find groups of about 100 that map to identical locations. Upon termination the step copies the .frg file from the temporary directory to the project directory and updates column 7 and 8 of the .PRO file.

# Sequencing

Requires: LIB_FILE_NAME, READ_LENGTH, READ_NUMBER, PAIRED_END, FASTQ, GEN_DIR, ERR_FILE_NAME, QTHOLD
Outputs: SEQ_FILE_NAME

This step produces about READ_NUMBER sequencing reads from the library in LIB_FILE_NAME. The simulator iterates the input annotation and maps the READ_LENGTH long stretches from the ends of cDNA molecules in the library LIB_FILE_NAME to genomic coordinates. In the case of READ_LENGTH exceeds the length of a cDNA molecule, the read is truncated to the length of the molecule. For each fragment a Bernoulli trial is carried out, by $r<p$ with $r$ being a uniformly sampled random variable in the boundaries $[0;1[$ compared to the sequencing probability

(7)
\begin{align} p= \frac{READ\_NUMBER}{LIBRARY\_NUMBER} \end{align}

By this, never more reads (respectively, read pairs) are generated than there are LIBRARY_NUMBER cDNA molecules. In the case of single end sequencing, randomly one end of the cDNA molecule that succeeded the Bernoulli trial is sequenced, and if PAIRED_END is set, correspondingly both ends are sequenced. The FLUX SIMULATOR shows you the number of reads and their fraction (relative to the planned number), the number of splicing loci represented in these reads (and the ratio they constitute of the total number of expressed loci), and the number of transcripts (ratio of total expressed spliceforms, respectively).

The default output formad is BED, which describes the genomic region of the read. Reads that fall partially into the poly-A tail are truncated to their respective content of genomic sequence. Reads that fall completely into the poly-A tail and that are sequenced receive poly-A as a special sequence name in the generated BED file. The tag name is composed according to the description of the FMRD (Flux Mapped Read Descriptor) convention. Obviously, multi-map information is not provided and has to be obtained by a subsequent alignment of the reads. If the genomic sequence is provided, additionally FASTA/FASTQ sequences can be produced (see sequencing error models). The corresponding tag equals the BED name field plus the additional information about the genomic alignment, i.e., field 1,2,3,6,11 and 12 of the BED format.

Examples:
Here an example for a BED line that represents a spliced read

chr1    2082    2503    chr1:1116-4272W:uc009vip.1:105:2772:695:1003:968:1003|P2 0    -    0    0    0,0,0 2 8,28    0,393


The complete region of the read spans from 2083 (note the 0-base in BED format) to position 2503 (which is the first excluded position in BED format and therefore directly translates to the last included position in a 1-based coordinate system) on the reference sequence chr1. The the read alignment is split in two parts, one from 2083 to 2083+8-1=2090, and the other one from 2083+393=2476 to 2476+28-1=2502. The name field denotes that the read has been the downstream mate P2 of a read pair, derived from the 105th transcript copy of the annotated uc009vip.1 structure (which has spliced length 2772) in splicing locus chr1:1116-4272W. The fragment of this transcript that has been sequenced starts at position 695 and ends at position 1003 in the spliced sequence, relative to the annotated transcription start. From this fragment, the subarea 968-1003 relative to the annotated transcription start has generated the read.

This bed line is translated to FASTA format as:

>chr1:1116-4272W:uc009vip.1:105:2772:695:1003:968:1003|P2|chr1:2082:2503:-:8,28:0,393
GAAGGGCATGCCTGGCATCACCACACACTGGCCTAG


[[/code]]
where the tag corresponds to the name field (number 4) in the BED format and the information about its genomic location as by BED coordinates concatenated from fields 1,2,3,6,11 and 12.

## Sequence output and error

Given a directory with genomic sequences split by chromosome GEN_DIR, FLUX SIMULATOR provides the possibility to additionally output the read sequences in FASTA or FASTQ format. If no error model ERR_FILE_NAME is provided, read sequences are an exact copy of the genomic sequence. Sequences of reads that are sequenced in antisense to the cDNA molecule are reverse complemented. Parts of the read that fall into the poly-A tail are correspondingly filled with a, respectively t characters whenever the read is produced in antisense direction. The read identifiers are unique tags, composed of locus, transcript and fragment information from which they have been derived.

Additionally, error models may be provided. Currently only position-specific errors are supported, see the original discussion on that topic. In short, positional error models base on the simple idea that "problems" that are observed in a certain region of the generated sequence cluster accross multiple reads of the output as they have been influenced by a common temporal and spatial problem during sequencing. Such errors can be estimated empirically after aligning the sequences of a run to the genomic reference sequence and identifying mismatching positions. Especially suited for obtaining such estimations are alignment programs that are not affected by strict limitations on the number of mismatch and hence allow for more complete pictures. The FLUX SIMULATOR provides an automatic error model estimation for alignments output by the GEM aligner. Error models are stored in form of ERR files.

Fig.2: Description of the positional error model in the FLUX SIMULATOR.

A: in order to enable fastaQ output (C), you need an error model. The model can be estimated either from an alignment or already be established in form of an .err file (format to described elsewhere).

B: If you parse the error profile from an alignment, you may provide a quality threshold on bases that are considered as unproblematic. Every basecall below this threshold is considered as part of an "accident" and correspondingly included in one of the error models.

C: once you have an error model you can activate fastaq output additonal to the bed files

D: The error model splits "accidents" according to their start and extension in the time scale of the experiment. E.g., an accident that is observed from cycle 5 to 7 has length 3, a different "accident" of length 3 could be observed from cycle 30 to 32. These "accidents" may overlap, so the accident of (start,extension) = (5,7) is separated from an accident (5,8).

E: The error models measures the substitution rate for each nucleotide with each other, respectively with "N" (black) split accross the different quality levels. If no substitution rate could be estimated for a certain quality, it gets interpolated from its clostest datapoints.

F: distribution of accident locations along the readlength. Red are cumulative starts, blue are cumulative ends and black shows all accidents overlapping a certain position.

G: distribution of qualities along the read. Red are the minimum, yellow the 1st quartile, green the median, blue the 3rd quartile and black the maximum value that has been recorded for each read position.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License