Sequence output and error

Given a directory with genomic sequences split by chromosome GEN_DIR, FLUX SIMULATOR provides the possibility to additionally output the read sequences in FASTA or FASTQ format. If no error model ERR_FILE_NAME is provided, read sequences are an exact copy of the genomic sequence. Sequences of reads that are sequenced in antisense to the cDNA molecule are reverse complemented. Parts of the read that fall into the poly-A tail are correspondingly filled with a, respectively t characters whenever the read is produced in antisense direction. The read identifiers are unique tags, composed of locus, transcript and fragment information from which they have been derived.

Additionally, error models may be provided. Currently only position-specific errors are supported, see the original discussion on that topic. In short, positional error models base on the simple idea that "problems" that are observed in a certain region of the generated sequence cluster accross multiple reads of the output as they have been influenced by a common temporal and spatial problem during sequencing. Such errors can be estimated empirically after aligning the sequences of a run to the genomic reference sequence and identifying mismatching positions. Especially suited for obtaining such estimations are alignment programs that are not affected by strict limitations on the number of mismatch and hence allow for more complete pictures. The FLUX SIMULATOR provides an automatic error model estimation for alignments output by the GEM aligner. Error models are stored in form of ERR files.

pem.png
Fig.2: Description of the positional error model in the FLUX SIMULATOR.

A: in order to enable fastaQ output (C), you need an error model. The model can be estimated either from an alignment or already be established in form of an .err file (format to described elsewhere).

B: If you parse the error profile from an alignment, you may provide a quality threshold on bases that are considered as unproblematic. Every basecall below this threshold is considered as part of an "accident" and correspondingly included in one of the error models.

C: once you have an error model you can activate fastaq output additonal to the bed files

D: The error model splits "accidents" according to their start and extension in the time scale of the experiment. E.g., an accident that is observed from cycle 5 to 7 has length 3, a different "accident" of length 3 could be observed from cycle 30 to 32. These "accidents" may overlap, so the accident of (start,extension) = (5,7) is separated from an accident (5,8).

E: The error models measures the substitution rate for each nucleotide with each other, respectively with "N" (black) split accross the different quality levels. If no substitution rate could be estimated for a certain quality, it gets interpolated from its clostest datapoints.

F: distribution of accident locations along the readlength. Red are cumulative starts, blue are cumulative ends and black shows all accidents overlapping a certain position.

G: distribution of qualities along the read. Red are the minimum, yellow the 1st quartile, green the median, blue the 3rd quartile and black the maximum value that has been recorded for each read position.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License