Formats

Read Descriptors

The BARNA Descriptor

The BARNA (Barcelona Attributes for RNA-Seq) descriptor is a proposition for annotating unique read IDs with additional information yielded from special experiments, e.g., read mates derived from the same cDNA molecule, information about the original transcript sequence orientation, etc. It replaces the [[include FMRD descriptor proposition for the Flux Capacitor. BARNA expects all attributes as flags, i.e., a pre-defined set of characters, which are recognized in the read suffix after the last '/' character of the read ID. Currently defined flags are:

Flag Meaning Context
1 first mate of a read pair paired-end sequencing
2 second mate of a read pair paired-end sequencing
s read in sense to transcription orientation strand-specific RT
a read in antisense to transcription orientation strand-specific RT

Example:

chr1    10041   10109   BILLIEHOLIDAY:6:90:1240:1493/2a 1       +       0       0       0,0,0   1       68
chr1    10105   10181   TUPAC:7:103:90:14/1a    10      -       0       0       0,0,0   1       76      0
chr1    10138   10206   TUPAC:7:117:1290:277/2s 2       +       0       0       0,0,0   1       68      0
chr1    10223   10294   TUPAC:8:48:251:1564/1  10      -       0       0       0,0,0   1       71      0

These read identifiers have been retrieved from an experiment with strand-specific reverse transcription (RT) and paired-end sequencing.
1st line: the read is the (non-ordered) second mate of a read pair with the common and unique ID BILLIEHOLIDAY:6:90:1240:1493. Moreover, it is know that the read aligns in antisense to the transcription directionality and to the + strand of the genome (i.e., the so-called "Watson strand"). Thus, the transcript sequence it has been derived from stems from the - strand of the genomic sequence ("Crick strand"), and by mate orientations produced by current paired-end technologies the mating read BILLIEHOLIDAY:6:90:1240:1493/1s should map to the - strand of the reference genomic sequence.
2nd line: first mate of read pair TUPAC:7:103:90:14 which aligns to the negative strand. It's mate TUPAC:7:103:90:14/2s should align to the Watson strand.
3rd line: second mate on a transript on that has been transcribed from the + strand of the genomic sequence.
4th line: first mate of read pair TUPAC:8:48:251:1564, that aligns to the - strand of the genome for which the information about transcription directionalty has been lost (that happens). If the strandedness information has been conserved in its mate TUPAC:8:48:251:1564/2, it can be recovered.

Regular Expressions for Descriptors

Regular expressions allow for a flexible description of attribute retrieval from generic read descriptors. A summary of the regular expression system adopted in the Flux Capacitor can be found there.

Example 1

chr1    10033   10109   PAN:2:27:1091:1987/2

the example shows a standard Illumina/Solexa identifier for paired sequencing, where /1 indicates the first mate of a pair and /2 the second mate. A corresponding regular expression would target the suffix by

/([12])_strand([12])$

Example 2

chr1    10033   10109   PAN:2:27:1091:1987/2_strand1

where /2 indicates the 2nd mate, and '_strand1' additionally indicates read orientation in sense to the transcription directionality could be described by the regular expression

/([12])_strand([12]??)$

Note: exactly two symbols are expected for identifying paired mates as well as sense/anti-sense orientation. In the case the information is optional, add

Standard file formats

BED format

The BED (Browser Extensible Data) format has been developed by UCSC for displaying transcript structures in the genome browser, a full description of the format can be found there.

The following fields used by the Flux Capacitor and the Flux Simulator.

field Number description
1 chrom: the name of the chromosome or scaffold
2 chromStart: start position, first included position, in chromosomal coordinates 0-based
3 chromEnd: end position, first excluded position, in chromosomal coordinates 0-based
4 name read identifier in FMRD format
6 strand the genome strand the read aligns to
10 blockCount number of blocks, for spliced reads which are subdivided in blocks
11 blockSizes extension of each block, comma-separated list
12 blockStarts start of each block, relative to chromStart, 0-based

Note: The sanity check, $chromStart + blockStarts_k + blockSizes_k = chromEnd$ must always hold, for $k$ being the last (respectively, the unique) element in the block vector!

Examples

chr1    1723    1759    chr1:1116-4272W#uc009vip.1#45:234|P1    0    +    0    0    0,0,0    1    36    0
chr1    1921    1957    chr1:1116-4272W#uc009vip.1#45:234|P2    0    -    0    0    0,0,0    1    36    0
chr1    2062    2483    chr1:1116-4272W#uc009vip.1#46:216|P1    0    +    0    0    0,0,0    2    28,8    0,413
chr1    2627    2663    chr1:1116-4272W#uc009vip.1#46:216|P2    0    -    0    0    0,0,0    1    36    0
chr1    856324    861042    chr1:850393-869824W#uc001abw.1#2:224|P1    0    +    0    0    0,0,0    2    8,28    0,4690
chr1    864337    864518    chr1:850393-869824W#uc001abw.1#2:224|P2    0    -    0    0    0,0,0    2    35,1    0,180

Go to the format's detail page to see ongoing discussions.

GTF format

The GTF (Gene Transfer Format) has been developed to facilitate the exchange of genome annotations (i.e., transcripts aligned to the genome) in human readable flat files. A complete description is available for instance at the Washington University or at UCSC.

The standard format description requires 8 mandatory fields which are tab-separated. Following is a list of optional attributes with the structure

key "value"; key2 "value2"; ...

Attention: file sorting that possibly is triggered by the Flux Capacitor or Simulator expects a consistent order of the first 8 fields AND the attribute transcript_id across lines. In concrete, the key transcript_id is expected always in the same field (i.e., column) as it is found in the first line.

Note: in contrast to the general format description, the Flux Capacitor and the Flux Simulator are crucially dependant on the attribute transcript_id which has to be unique on the chromosome a certain transcript has been annotated on (as met by the UCSC standard). The attribute gene_id is not necessary, as both programs perform an intrinsic clustering of transcript into loci, i.e., spliceforms that overlap on the same strand. Further, each transcript requires at least one exon feature. Additional CDS features are optional, and mark the corresponding transcript as coding.

Flux Capacitor

The Flux Capacitor reads reference annotations in GTF format, considering exclusively exon features and transcript_id attributes. Adjacent exons, i.e. exon pairs with the same transcript_id of which one exon starts at the position directly after the end of the other exon, are merged into one exon. This may result in a lower number of exon feature lines in the output. Intrinsically, all transcript models are clustered into loci, i.e., transcripts that overlap on the same strand in the genomic region they align to. The attribute gene_id is adjusted to a UCSC genome browser compatible string chr:start-end and an additional character, W respectively C, indicating the strand.

In the output, the Flux Capacitor writes the following features

Feature Description
exon Standard GTF compliant GTF feature.
as_event Alternative Splicing events, compliant with the AStalavista definition. For a description of attributes used for as_event features, see the AStalavista GTF format description.
transcript A spliceform as annotated in the reference annotation.
locus Spliceforms that overlap on the same strand form a splicing locus.
segment A segment, (part of) an exon.
junction A junction, joining two adjacent exon segments or a splice-junction joining two exon segments across an intron.

The Flux Capacitor can output multiple abundance measures in a 3-token format separated by _'' (underscore''). The 3 tokens denote the base, the resolution, and the measurement of the measurement.

Base Description
obs Measurement is based on the number of reads that are observed to map to the feature according to the rules described in mapping.
pred Measurement is based on the number of reads that are predicted for the feature after flow network decomposition.
Resolution Description
all all reads that overlap the area of the feature (i.e., that fall into one of the possible slots) are taken into account for the measurement.
split reads that stem from the spliceforms listed in the transcript_id attribute of the feature are taken into account for the measurement.
uniq exclusively reads that fall into unique regions of the spliceforms listed in the transcript_id attribute are taken into account for the measurement.
Measurement Description
freq the absolute number of read(-pairs) that map to the feature.
rfreq the relative frequency of the feature, i.e., the fraction of read mappings from all read mappings in the experiment.
rcov the relative coverage, i.e., read mappings freq divided by the number of distinct read mappings in the feature. See the description of the slots attribute below.

For the calculation of the coverage measures, the Flux Capacitor counts the number of distinct read locations („slots”) in the feature with respect to the corresponding resolution. These numbers are included in the output as 2-token attributes slots_all, slots_split, respectively slots_uniq.

Examples

chrX myGenes transcript 123 455 . + . transcript_id="myTranscript"; slots_all "297"; slots_split "297"; slots_uniq "0"; obs_all_reads "69"; obs_split_reads "31"; obs_uniq_reads "0"; obs_all_rfreq "1.3467e-5"; obs_split_rfreq "6.051e-6"; obs_uniq_rfreq "0"; obs_all_rcov "45.3434"; obs_split_rcov "20.0372"; obs_uniq_rcov "0"; pred_all_reads "73"; pred_split_reads "62"; pred_uniq_reads "3"; pred_all_rfreq "1.4248e-5"; pred_split_rfreq "1.2102e-5"; pred_uniq_rfreq "5.8554e-7"; pred_all_rcov "40.9731"; pred_split_rcov "40.744"; pred_uniq_rcov "1.9715";

Attribute Value Description
slots_all 297 number of different mappings to the feature
slots_split 297 subset of slots_all that map to the spliceforms listed in transcript_id that fall into the feature
slots_uniq 0 subset of slots_split that map exclusively to the spliceforms listed in transcript_id
obs_all_reads 69 read mappings that fall into exonic areas of transcript myTranscript
obs_split_reads 31 for each segment, the number of obs_sum_reads is divided by the number of transcripts there. The sum of these fractions forms the split reads. Use this to assess the difficulty of decomposition.
obs_uniq_reads 0 reads that align in unique regions of the transcript
obs_all_rfreq 1.3467e-5 obs_sum_reads divided by the number of read mappings
obs_split_rfreq 6.051e-6 obs_split_reads divided by the number of read mappings
obs_uniq_rfreq 0 obs_uniq_reads divided by the number of read mappings
obs_all_rcov 1.3467e-5 obs_sum_rfreq divided by slots_all
obs_split_rcov 6.051e-6 obs_split_rfreq divided by slots_split
obs_uniq_rcov 0 obs_uniq_rfreq divided by slots_uniq
pred_all_reads 73 pred_split_reads plus reads that have been predicted for other transcripts in overlapping segments
pred_split_reads 62 reads that have been predicted for the transcript. Use this value in combination with obs_reads to assess the reliability of the prediction.
pred_uniq_reads 3 reads that have been predicted in unique segments. Use this value in combination with obs_uniq_reads to assess the reliability of the prediction.
pred_all_rfreq 1.4248e-5 pred_sum_reads divided by the number of read mappings. Use this value in combination with obs_sum_rfreq to assess the reliability of the prediction.
pred_split_rfreq 1.2102e-5 pred_split_reads divided by the number of read mappings. Use this value to compare the same transcript in different experiments.
pred_uniq_rfreq 5.8554e-7 pred_uniq_reads divided by the number of read mappings. Use this value in combination with obs_uniq_rfreq to assess the reliability of the prediction.
pred_all_rcov 40.9731 pred_sum_rfreq divided by slots_all.
pred_split_rcov 40.744 pred_split_rfreq divided by slots_split. Use this value to compare different transcripts accross different experiments.
pred_uniq_rcov 1.9715 pred_uniq_rfreq divided by slots_uniq. Use this value in combination with obs_uniq_rpkm to assess the reliability of the prediction.

Go to the format's detail page to see ongoing discussions.

FASTA formats (.FA and .FQ)

Fasta formats are used very commonly as they provide easy (descriptor,sequence) tuples. Generally, it can be differentiated between single-FASTA files — that contain a single sequence — and multi-FASTA files, which correspondingly contain more than one sequence. The Flux Capacitor and Simulator programs usually output multi-FASTA files, an exception is the genomic sequence files, which are to be located in a common directory, with a file chr.fa for each chr annotated in the corresponding GTF annotation file.

FA, FASTA format

The original Fasta (FA) file format is rather simple. Each fasta block contains a description line that starts with a ">" ("greater than") symbol and multiple lines containing the sequence itself. Further examples for FASTA format can be found for instance here.

Oftenly, the description line is tokenized into different tags, separated by either "|" ("pipe", as in NCBI standard) or ";" ("semi-colon", as in the Pearson FASTA format). The Flux Capacitor and the Flux Simulator use these separators to divide the descriptor line in the fields of the Flux Mapped Read Descriptor.

Go to the format's detail page to see ongoing discussions.

FQ, FASTQ format

In contrast to FA files, the leading character of the description line in FASTAQ (Fasta Quality) files is "@" ("at"). Then, the sequence is following, and afterwards a separator line that is leaded by a "+" ("plus") character indicates the start of the qualities. See here for some examples on the FASTQ file format. The Flux Simulator encodes qualities as ASCII characters by adding an offset of 64 according to the Illumina standard.

Go to the format's detail page to see ongoing discussions.

Proprietary Flux Capacitor formats

ERR: error model format

Important, read this: As by build 20090729 of the Simulator, the .ERR format has changed to allow for both, quality-based models and such without. The article here describes the new version (after build 20090729), go there to see the description of the .ERR format before build 20090729.

Error model (ERR) files are proprietary to the Flux Simulator and optionally used during the sequencing process. Data is organized in blocks and presented in tokens separated by whitespaces. There are 4 different block types:

Probability distributions over a discrete value space (e.g., quality values, substitution symbols, etc.) are coherently described by their cumulative distribution functions (CDFs). As by their nature, the number series in a CDF have to be monotonously increasing with (at least) the last value of a series being 1.0.

Model Pool Summary

#MODEL readLen nrInstances [minQual maxQual tholdQual]
[p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)]
expression (example) verbose explanation
#MODEL tag introducing the model description block
readLen (36) The readlength for which the model has been built. Important: in the Simulator you cannot adopt error models for sequencing reads of different length
nrInstances (916311) number of instances: on how many observations (i.e., reads) the error model has been estimated on
minQual (-40) minimum quality: the minimum value for qualities in the described error models. Currently exclusively integer quality models (as Illumina and phred qualities) are addressed. Therefore, subsequent CDFs over quality spectra have all the length (maxQual - minQual + 1). Only for error files that have been built with quality values.
maxQual (40) maximum quality: highest value of the quality spectrum, an integer - see above. Only for error files that have been built with quality values.
tholdQual (.) the threshold quality: level below which below which all base-calls have been considered "problematic" or "accident", regardless whether the corresponding base had been called correctly or not. If none such threshold has been applied, tholdQual should be set to "." Only for error files that have been built with quality values.
p(minQual), $\ldots$, p(maxQual) CDF over qualities of "unproblematic" base calls. A base call is considered as unproblematic iff it is (i) correct and (ii) equal or above the level specified by tholdQual. Only for error files that have been built with quality values.

Crosstalk Table

#CROSSTALK letter                                                          
[minQual] p(A) p(C) p(G) p(N) p(T)                                            
[minQual+1] p(A) p(C) p(G) p(N) p(T)                                      
...                                                                  
[maxQual-1] p(A) p(C) p(G) p(N) p(T)                                        
[maxQual] p(A) p(C) p(G) p(N) p(T)
expression (example) verbose explanation
#CROSSTALK tag that introduces a crosstalk description block
letter (A) Symbol, for which the crosstalk is specified as the observed substitution rates broken down by quality levels.
minQual $\ldots$ maxQual (-40,…,40) quality level for the following observed substitution rates p(X) apply. Only for error files that have been built with quality values.
p(A),p(C),p(G),p(N),p(T) probabilities (or CDF) for the symbol specified by letter to be substituted by A, C, G, N, or T.

Position-based error models

# PositionErrorProfile start length baseProb
[start p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)
(start+1) p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)
...
(start+length-1) p(minQual) p(minQual+1) ... p(maxQual-1) p(maxQual)]
expression (example) verbose explanation
#PositionErrorProfile tag that introduces position error profile block
start (26) first position in the read affected by this error model (1-based) (0-based)
length (11) extension of the "problem" captured in this error profile. Consequently, the 0-based index of the last position affected is (start+length-1).
baseProb (6.875394925958544E-5) probability as fraction of reads that shared this problem in the observed dataset. Multiplying this probability with the value nrInstances in the #MODEL block recasts the number of instances in which this error has been observed.
start+i p(minQual) p(minQual+1) $\ldots$ p(maxQual) (26 0.11 0.11 0.13 0.26 $\ldots$) probabilities (or CDF) of the distribution of qualities at the corresponding position. Only for error files that have been built with quality values.

Sequence-based error models

(forthcoming)

Go to the format's detail page to see ongoing discussions.

IML: intron model format

Intron model files describe the format of splice site combinations that are considered as potential intron. Discriminatory attributes of biological introns are (1st) the distance of the donor/acceptor pair, (2nd) the combination of their splice site sequences. Each model block is introduced by a header line.

#MODEL minDist maxDist

where #MODEL introduces a new model, and minDist respectively maxDist delimit the boundaries on the lengths of valid introns that are described by the model. Subsequently, a list of donor/acceptor sequences that may co-occur in valid introns is provided.

donorSeq1 acceptorSeq1
donorSeq2 acceptorSeq2
$\ldots$ $\ldots$

The sequences are the strings directly adjacent to exons, and may be redundant — as combinations are evaluated — as their length may vary, even amongst donors and acceptors.

Go to the format's detail page to see ongoing discussions.

PAR: the parameter file format

The format of the PAR (Simulation Parameters) files used by the FLUX SIMULATOR to administrate all parameters of a run. It is a tab-separated 2-column list of key value pairs and comprises the parameters:

key values description
REF_FILE_NAME String path to the reference annotation, either absolute or relative to the location of the parameter file
PRO_FILE_NAME String path to the profile of the run, either absolute or relative to the location of the parameter file
LIB_FILE_NAME [String] path to the library file, either absolute or relative to the location of the parameter file
BED_FILE_NAME SEQ_FILE_NAME String path to the bed file with the genomic annotation of the simulated sequencing reads, either absolute or relative to the location of the parameter file
GEN_DIR String path to the directory with the genomic sequences of chromosomes or scaffolds used in the reference annotation.
NB_MOLECULES [Integer] Number of initial RNA molecules in the simulation
LOAD_CODING [YES|NO] Flag to load coding transcripts from the reference annotation.
LOAD_NONCODING [YES|NO] Flag to load the non-coding transcripts, i.e., transcripts without CDS features, from the reference annotation
EXPRESSION_K Float Power law parameter $k$ of the expression simulation, should be <0.
EXPRESSION_X0 Integer Number of molecules for the highest expressed transcript, depends on NB_MOLECULES
EXPRESSION_X1 Float Parameter determing the exponential decay in the expression simulation
RT_PRIMER [RANDOM|POLY-DT] Flag to switch between random priming and poly-dT priming for the first strand synthesis of the reverse transcription
RT_MIN Integer Minimum length (in [nt]) of the expected reversely transcribed cDNA molecules
RT_MAX Integer Maximum length (in [nt]) of the expected reverse transcription products
FRAGMENTATION [YES|NO] Optional: flag that determines whether a fragmentation step is carried out
FRAG_B4_RT [YES|NO] flag to schedule the fragmentation before (YES), or after (NO) the reverse transcription. Note for fragmentations carried out before reverse transcription, exclusively random priming strategies are reasonable.
FRAG_MODE [PHYSICAL|CHEMICAL] flag to switch between fragmentation according to physical or chemical attributes.
FRAG_LAMBDA Integer Upper boundary of fragment lengths (in [nt]) that are not expected to be fragmented by the applied technique
FILTERING [YES|NO] Flag to indicate whether a length filtering step is carried out on the cDNA library.
FILT_MIN Integer Minimum length that is retained during filtering.
FILT_MAX Integer Maximum length that is retained during filtering.
READ_NUMBER Integer Number of reads that are intented to produce. Note: this number is an upper boundary and gets adapted to the actual size of the intermediary generated library.
READ_LENGTH Integer Length of the generated reads, depends on filtering settings.
PAIRED_END [YES|NO] Flag to indicate whether read pairs are produced.
FASTQ [YES|NO] Flag that indicates whether additionally the read sequences and qualities are output. Depends on GENOME_DIR and ERR_FNAME.
QTHOLD Integer Quality value below which base-calls are considered problematic.
TMP_DIR String Path to folder for temporary files, if different from system standard (commonly /tmp on Unix clones).

Go to the format's detail page to see ongoing discussions.

PRO: the profile format

The PRO (Simulation Profile) format is designed to describe the characteristics of each transcript (i.e., line) from the reference annotation, initially and after each step of the simulation (i.e., columns). Columns are tab-separated and describe the attributes:

~Column Number Name Value Description
1 LOCUS_ID chrom:start-end[W|C] identifier for the intrinsic splicing locus, given by the chromosome (chrom), start and end position and the strand (Watson or Crick).
2 TRANSCRIPT_ID String transcript identifier from the reference annotation.
3 CODING [CDS|NC] specifies whether the transcript has an annotated CDS or not (NC)
4 LENGTH Integer the spliced length of the transcript molecule as annotated in the reference annotation
5 RFREQ_EXP Float relative frequency of RNA copies of this transcript after simulated expression
6 AFREQ_EXP Integer absolute number of expressed RNA molecules
7 RFREQ_LIB Float relative frequency of cDNA molecules derived from this transcript after library construction
8 AFREQ_LIB Integer absolute number of cDNA molecules generated from this transcript
9 RFREQ_SEQ Float relative frequency of reads sequenced from this transcript
10 AFREQ_SEQ Integer absolute number of reads sequenced from this transcript

Go to the format's detail page to see ongoing discussions.

LIB: the library format

The format of LIB (Simulated Library) files is simple and condenses the information needed to describe a fragment (RNA or cDNA) of an original transcripts. Each line corresponds to one such fragments and in 3 tab-delimited fields the estart, eend in the spliced sequence (exonic) of the transcript with transcript_id of the original annotation.
[div]
Note: because the simuated transcription start and length of the poly-A tail may vary from the annotation in the reference, values for estart can drop below , and values for eend can take values higher than the transcript length.
[/div]

Example

estart eend transcript_id
1396 1623 uc001aaa.2
355 583 uc001aaa.2
-43 195 uc001aaa.2
407 635 uc001aaa.2
278 519 uc009vjk.1
330 562 uc009vjk.1
-99 136 uc001aaz.1

Go to the format's detail page to see ongoing discussions.

Proprietary third-party formats

GEM mapping (MAP) format

MAP (mapping) files are produced by the GEM mapper, a suffix-array based technique to efficiently map short reads to the genome.

Go to the format's detail page to see ongoing discussions.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License