Recent Forum Posts
From categories:
page 1123...next »

Hi Arun,

thank you again for your detailed report, I transferred your observations to ticket 213 in the issue tracker. In the future, please directly use our Jira tracker for the kind of observations that are possibly linked to errors (bugs). Some brief instructions on how to file and track issues are summarized in a section of the documentation.

For more general questions and discussions please use in the future the new forum in the Flux Simulator documentation, because it is easier for us (and we think also for the other users) to keep things together in one system.

Cheers, micha

Re: profile (pro) format by gmichagmicha, 03 Aug 2012 20:59

Hi,

The profile file of the runs are like this:

Chr1:3631-5899W AT1G01010.1     CDS     1688    2.7144710738819023E-6   19      2.9198003423841305E-6   35      3.4286017961873377E-6   48      0.8524881601333618      1187    0.43380793895775216

I count 13 columns. However, the flux simulator website where all file format specifications are contained accounts for 10 columns. Could you please explain which is which and what the other columns are?

Thanks,
Arun.

profile (pro) format by adragonflyadragonfly, 03 Aug 2012 09:38

Hey,

thanks for the info. I created a JIRA issue here ( http://sammeth.net/jira/browse/BARNA-206 ) and added you to the watch list. I will try to look into this as soon as possible. Please don't hesitate to create Issues in the JIRA directly. We are in the process of moving all our resources, including this forum, into one central location.

Best,

-thasso

Re: non-unique fastq read-ids by ThassoThasso, 02 Aug 2012 09:34

Some reads are identical in all respects and occur twice (both in fastq and bed of course). Here's an example of the header from a simulated fastq:

@Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:S/2
@Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:A/1
@Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:S/2
@Chr1:8017397-8021712W:AT1G22660.1:541:2005:723:872:A/1

Fastq files are supposed to have unique ids. I am wondering how softwares usually handle this (do they just skip the 2nd occurrence?) and hence, the impact on the reads after mapping.

I did a test on a fastq generated from flux-simulator directly, that contained a total of about 14 million reads and found 2 * 489353 = app. 1 million reads duplicated. I have removed them for now. Now, the reads may occur duplicated (due to amplification or other factors or due to chance which is highly unlikely here though), however, the read IDs in those cases must be somehow changed. Right?

Arun.

non-unique fastq read-ids by adragonflyadragonfly, 01 Aug 2012 23:05

Hi Micha,

Thanks for taking a look at this issue, I've signed up for the watch list. Do you know why it is trying to sequence past the exon in the first place? In this case it is valid because the 'exon' I gave it is just the open reading frame, but is the behaviour of the simulator such that it tries to infer the 3'UTR (similar to the TSS parameter but at the end), or is this not expected?

Albi

by Albi (guest), 26 Jul 2012 14:01

Hi,

that sounds related to a bug we fixed recently, I re-opened issue http://sammeth.net/jira/browse/BARNA-194
You maybe want to put yourself in the watchlist of the ticket to monitor the progress (requires account).

Thanks!

by gmichagmicha, 25 Jul 2012 22:56

I managed a work-around wherein I add 100 N nucleotides to the end of each of my 'chromosomes' as buffer and updated the ranges of the GTF file accordingly. Obviously this is not ideal (I would rather it not sequence past the boundary rather than having to add garbage to the simulator which I will have to remove later)

by Albi (guest), 25 Jul 2012 22:05

I am getting the following error with the latest nightly build (I am trying to simulate expression from a reference genome with many contigs, hence many short chromosomes)

SEQUENCING] getting the reads
        Initializing Fragment Index
        Indexing ********** OK (00:00:03)
        2818814 lines indexed (2818814 fragments, 2630 entries)
        sequencing [WARN] skipped line ACCK01000001.1   mgb2gtf gene    148     1200    .       +       .       transcript_id "CATMIT_00001"; gene_id "CATMIT_00001";
*[ERROR] Error while sequencing : Problems reading BED object ACCK01000003.1    14785   14861   ACCK01000003.1:14115-14861W:CATMIT_00041:5:747:466:746:A/2      0       -       .       .       0,0,0   1       76      0
java.lang.RuntimeException: Problems reading BED object ACCK01000003.1  14785   14861   ACCK01000003.1:14115-14861W:CATMIT_00041:5:747:466:746:A/2      0       -       .       .       0,0,0   1       76      0
        at barna.model.bed.BEDobject2.readSequence(BEDobject2.java:335)
        at barna.flux.simulator.Sequencer.createQSeq(Sequencer.java:600)
        at barna.flux.simulator.Sequencer$SequenceWriter.writeRead(Sequencer.java:936)
        at barna.flux.simulator.Sequencer$Processor.process(Sequencer.java:771)
        at barna.flux.simulator.Sequencer.sequence(Sequencer.java:247)
        at barna.flux.simulator.Sequencer.call(Sequencer.java:120)
        at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:436)
        at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:53)
        at barna.commons.launcher.Flux.main(Flux.java:212)
Caused by: java.lang.RuntimeException: Problems reading chromosome ACCK01000003.1 from 14785 to 14780

This is because the corresponding genome file does indeed not contain sequence at that range:

grep -v '>' genomes/ACCK01000003.1.fa | grep -o '[ATGC]' | wc
  14780   14780   29560

(The file is a fasta header and one long line of sequence so the length isn't affected by whitespace characters)

But I'm not sure why it's trying to sequence from there anyway:

grep 'CATMIT_00041' gtf/Catenibacterium_mitsuokai_DSM_15897.gtf
ACCK01000003.1  mgb2gtf gene    14015   14761   .       +       .       gene_id "CATMIT_00041"; transcript_id "CATMIT_00041";
ACCK01000003.1  mgb2gtf CDS     14015   14761   .       +       .       gene_id "CATMIT_00041"; transcript_id "CATMIT_00041";
ACCK01000003.1  mgb2gtf exon    14015   14761   .       +       .       gene_id "CATMIT_00041"; transcript_id "CATMIT_00041";

Here are my parameters:

## File locations
REF_FILE_NAME   gtf/Catenibacterium_mitsuokai_DSM_15897.gtf
GEN_DIR         genomes

TSS_MEAN        NaN (I usually use 25, I am using NaN to debug, but it does not make a difference)
#
FRAG_SUBSTRATE  RNA
##

POLYA_SCALE     NaN
POLYA_SHAPE     NaN
##
RTRANSCRIPTION  NO

## Expression
NB_MOLECULES    20000000
FRAG_METHOD     UR

## Sequencing
READ_NUMBER     35149
READ_LENGTH     76
PAIRED_END      YES

## FILTERING
FILTERING       YES

SIZE_SAMPLING   AC

TMP_DIR tmp
FASTA true
ERR_FILE empirical.err

PCR_DISTRIBUTION none

RT_LOSSLESS false
FRAG_UR_ETA     300
SIZE_DISTRIBUTION       distribution.txt

It seems changing the TSS does not solve it (it is on the positive strand anyway).

Hi Micha,
Thank you very much for the quick response and your efforts. I've created an account and added to the watch list.

Best,
Arun.

Hi Arun,

thank you for your detailed posts! I transferred all 4 points you raised to our issue tracker:

1) http://sammeth.net/jira/browse/BARNA-195
2) http://sammeth.net/jira/browse/BARNA-196
3) http://sammeth.net/jira/browse/BARNA-197
4) http://sammeth.net/jira/browse/BARNA-198

It would be best if you create yourself an account and put yourself in watchlist of these issues, then you get informed as soon as me or another developer updates on either one of the points.

Thanks again,

micha

Hi,

In my last post, I mentioned about the error during "PCR amplification". The error was described mainly due to TSS_MEAN parameter which at times crosses the valid boundary of a chromosome. I have been experimenting with other parameters I find necessary and I have not had success with most of them I require. I'd like to point them out in the hopes that they would get fixed. I am using flux-simulator-1.0.2 and the recent nightly build as pointed by micha after fixing the negative location access as well. The ones I mention here are tested in both. I'll post the parameter file once, for which it works, and then quote the changes that are not possible (leading to error) under each sub-section.

Parameter file that works!

LOAD_CODING    true
LOAD_NONCODING    true
NB_MOLECULES    8000000
EXPRESSION_K    -0.6
EXPRESSION_X0    9500.0
EXPRESSION_X1    9.025E7
TSS_MEAN    NaN
POLYA_SCALE    NaN
POLYA_SHAPE    NaN
RTRANSCRIPTION    true
RT_LOSSLESS    true
RT_MAX    5500
RT_MIN    50
RT_PRIMER    RH
FRAGMENTATION    true
FRAG_SUBSTRATE    RNA
FRAG_METHOD    UR
FRAG_UR_D0    1
FRAG_UR_DELTA    NaN
FILTERING    true
PCR_DISTRIBUTION    default
PAIRED_END    true
READ_LENGTH    76
ERR_FILE    76
READ_NUMBER    15000000
FASTA    true

1) Error: TSS_MEAN still gives error. Yes, the nightly build version also gives error when using PCR amplification.

TSS_MEAN 25
POLYA_SCALE    NaN
POLYA_SHAPE    NaN

This doesn't work, ends up with the same error: I was not able to perform even filtering before. But now, it passes that stage, but fails at PCR amplification.

[INFO] Loading default PCR distribution
    preparing transcript sequences ******* ERROR
[ERROR] Error while preparing sequences: Problems reading sequence Chr3: pos -1, len 100,
...
FAILED
[ERROR] Error while fragmenting : Problems reading sequence Chr3: pos -1, len 100,
check whether chromosomal sequence exists / has the correct size
java.lang.RuntimeException: Problems reading sequence Chr3: pos -1, len 100,
check whether chromosomal sequence exists / has the correct size
   ... 8 more

2) Error: SIZE_DISTRIBUTION with N(mean, sd) won't work. It just doesn't.

FILTERING    true
SIZE_DISTRIBUTION    N(200,50)
SIZE_SAMPLING    AC

I get [Error] 100. I have tried with UR and with NB fragmentation, I have tried setting FRAG_UR_ETA and by not setting ETA. Nothing works if I give N(mean, sd). Note that the size distribution mean is being set to NaN.

[LIBRARY] creating the cDNA libary
[INFO] **Size Distribution : Distribution mean: NaN**
...
    Processing Fragments ********** OK (00:05:13)
        71948002 mol: in 71914801, new 33201, out 71931246
        avg Len 165.75595, maxLen 575
        initializing Selected Size distribution
[ERROR] 100

3) question: When I use the parameter file listed at the very beginning, that is, no SIZE_DISTRIBUTION, TSS_MEAN, POLYA_SCALE, POLYA_SHAPE and it runs successfully, The UR fragmentation results in average fragment length of 188 with maximum of 555 and a total of 76 million molecules. And when doing a size selection on this, the output average is 181 with max: 299 and gives me 2.1 million molecules which is approximately 3% of total molecules from fragmentation. How is this possible? The average is 188 out of 76 million and you get at the end 2 million between 181 and 299…?? It just doesn't add up, does it? Or did I get it completely wrong?

[LIBRARY] Configuration
        Mode: RH
        PWM: No
        RT MIN: 50
        RT MAX: 5500

    Processing Fragments ********** OK (00:05:34)
        76054743 mol: in 76027642, new 27101, out 76054743
        avg Len 156.77885, maxLen 555
        initializing Selected Size distribution
[LIBRARY] Segregating cDNA (Acceptance)
    Processing Fragments ********** OK (00:02:05)
        76054743 mol: in 76054743, new 0, out 2057527
        avg Len 181.46268, maxLen 299
        start amplification

4) question: When I continue to PCR amplification of these 2 million fragments, it gives me 55 million fragments and then the fastq file. Now, when I map them using tophat to my genome, only 30% of the reads mapped. When I tried to find the error by mapping using BWA, I found that the inner distance between the reads were around 500-1200. Again, how is this possible? BWA is not a spliced read aligner and it gives me inner distance this high when my fragments are supposedly between 181 and 300 (most of them anyways, assuming there would be more reads from non-spliced regions than splice junctions). When you mean fragment size, does it include adapters? If so, assuming 35bp on both side the effective insert size is 300-70 = 230 bp. If you mean insert size as fragment, then its 300. Eitherways, the inner distance between the two pairs must be between (230 - 76*2)=78 and (300-76*2)=148 on an average. But when I calculated from my script to obtain mean and sd of inner distance, mean=360 and sd=760 (after mapping). Is there an explanation for this? Or am I understanding something totally wrong? This is not possible unless almost all such reads are generated over splice junctions… isn't it?

BWA output:

[infer_isize] (25, 50, 75) percentile: (96, 212, 732)
[infer_isize] inferred external isize from 133773 pairs: 406.571 +/- 449.596
[infer_isize] inferred maximum insert size: 3199 (6.21 sigma)
...
[infer_isize] (25, 50, 75) percentile: (234, 705, 1613)
[infer_isize] inferred external isize from 141160 pairs: 962.510 +/- 918.874
[infer_isize] inferred maximum insert size: 6568 (6.10 sigma)

I am attaching my genomic fasta and gtf here. It would be great if you could look in to these errors/questions.

Best,
Arun.

Hi gmicha,

Thank you for your reply. You can download from here: https://dl.dropbox.com/u/3851628/Flux_simulator_test.zip
It contains 1) Tair10.gtf 2) par file 3) pro file for which the run ended up in error.
I ran the command flux-simulator -x -l -s -p test2.par

After quite a bit of trial and error I managed to successfully implement filtering and PCR amplification However, its a very narrow margin. In general these are the findings I have so far:

tss_mean should be NaN - absolutely necessary (for the reasons you mentioned above… reg. Chr position being out of scope). However, even then, only a specific combination works for me.
fragment_substrate is RNA
fragment_method UR (nebulization doesn't work on RNA)
rt_motif default
filtering true (even here, setting size_distribution to N(mean,sd) doesn't work)
pcr_distribution default

If I change substrate to DNA and use nebulization, it doesn't work. If I set mean and sd for filtering, it doesn't work. and if I remove RT_motif from default, it doesn't work. But most of them point to Chr pos -1 error. Otherwise it ends with [Error] 100 so far.

I hope this helps you in finding/fixing (the bug if any).

Hi Dragonfly,

the issue seems to be triggered when reading the sequence of a gene that is very close to the chromosome border. Could you please provide a .pro file from the run? I opened ticket http://sammeth.net/jira/browse/BARNA-194 to track on that issue.

Thank you for bringing that up

Hi, (Edit: I just realized that I posted in the wrong section. Sorry about that. I couldn't delete it to make a new post in the library section).

I have issues with certain parameters in flux simulator. This is the parameter file I am using:

## File locations
GEN_DIR                ./genome
## gene expression
PRO_FILE_NAME        ./test.pro
LIB_FILE_NAME        ./test.lib
REF_FILE_NAME        ./TAIR10_gene_models.gtf
LOAD_CODING    true
LOAD_NONCODING    true
## gene profile
NB_MOLECULES    5000000
EXPRESSION_K    -0.6
EXPRESSION_X0    9500.0
EXPRESSION_X1    9.025E7
## transcript profile
TSS_MEAN    25
POLYA_SCALE    50
POLYA_SHAPE    2

## reverse transcription
RTRANSCRIPTION    true
RT_LOSSLESS    true
RT_MAX    5500
RT_MIN    200
RT_PRIMER    RH

## Fragmentation
FRAGMENTATION    true
FRAG_SUBSTRATE    RNA
FRAG_METHOD    UR
FRAG_UR_ETA    250
FRAG_UR_D0    1
FRAG_UR_DELTA    NaN

## size selection
FILTERING    true
SIZE_DISTRIBUTION    N(250,30)
SIZE_SAMPLING    AC    

## pcr
PCR_DISTRIBUTION    default
PCR_PROBABILITY    0.7
GC_MEAN    0.5
GC_SD    0.1

PAIRED_END    true
READ_LENGTH    76
ERR_FILE    76
READ_NUMBER    16000000
FASTA    true

Here, it runs successfully if I don't use both "filtering" and "PCR distribution". Using either one of them ends in error.

When I set ** filtering TRUE**, then the error is:

[LIBRARY] Configuration
        Mode: RH
        PWM: No
        RT MIN: 200
        RT MAX: 5500

    Processing Fragments ********** OK (00:07:51)
        57088824 mol: in 57088725, new 99, out 57081277
        avg Len 214.97769, maxLen 747
        initializing Selected Size distribution
[ERROR] 100

And when I use PCR distribution default:

[LIBRARY] Configuration
        Mode: RH
        PWM: No
        RT MIN: 200
        RT MAX: 5500

    Processing Fragments ********** OK (00:03:57)
        34569813 mol: in 34569746, new 67, out 34565024
        avg Len 214.94418, maxLen 744
        start amplification
[INFO] Loading default PCR distribution
    preparing transcript sequences ******* ERROR
[ERROR] Error while preparing sequences: Problems reading sequence Chr3: pos -1, len 100,
check whether chromosomal sequence exists / has the correct size
java.lang.RuntimeException: Problems reading sequence Chr3: pos -1, len 100,
check whether chromosomal sequence exists / has the correct size
    at barna.model.Graph.readSequence(Graph.java:585)
    at barna.model.Graph.readSequence(Graph.java:473)
    at barna.model.Transcript.getSplicedSequence(Transcript.java:459)
    at barna.flux.simulator.fragmentation.Fragmenter.getMapTxSeq(Fragmenter.java:375)
    at barna.flux.simulator.fragmentation.Fragmenter.process(Fragmenter.java:539)
    at barna.flux.simulator.fragmentation.Fragmenter.call(Fragmenter.java:258)
    at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:424)
    at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:53)
    at barna.commons.launcher.Flux.main(Flux.java:212)
Caused by: java.lang.NegativeArraySizeException
    at barna.model.Graph.readSequence(Graph.java:507)
    at barna.model.Graph.readSequence(Graph.java:473)
    at barna.model.Graph.readSequence(Graph.java:529)
    ... 8 more
 FAILED

I am using GTF file I created from TAIR 10 GFF file. I have also tried with the TAIR9 GTF file that's downloadable from flux-simulator manual page (with example from A. thaliana). Both of them end up in this error every single time. It would be great if I can get some hints on fixing this.

Thank you

Hi,

thank you for your suggestion, we will think about adding the functionality to output sam files instead of fastq and bed files. I filed a ticket with the request for enhancement: http://barnaserver.com/jira/browse/BARNA-192

Cheers, micha

by gmichagmicha, 04 Jul 2012 13:19

Hi Yang,

sorry for the late reply. If I understand, you are asking about the simulated gene expression. The genes or splice-forms that are expressed in a cell depend on the tissue. The Flux simulations assign expression levels according to general laws, but they do not assume a specific tissue. Thus, it is rather unlikely that you see all 15 isoforms in your example expressed, at least not at substantial levels, and much more likely to see close to 2 isoforms expressed.

The second part of your question is what happens if you provide annotations that have many more spliceforms than usual annotations. My guess is, that you have to adjust the parameters EXPRESSION_K, EXPRESSION_X0 and EXPRESSION_X1 to end up with a reasonable amount of expressed transcripts. Refer to the manual pages athttp://sammeth.net/confluence/display/SIM/4.1.1+-+Gene+Expression+Profile, or to the paper (DOI: 10.1093/nar/gks666) that should be available soon.

Let me know what are your experiences, I may be able to help you more.

Best, micha

Hi Duplexa,

no, currently there is no command line flag; you could set the constant MAX_LENGTH_INTRON_IS_GAP in class Transcript to < 1 in order to prevent from exon merging. However, the current value of 4nt is a rather relaxed criterion to distinguish real introns from alignment errors—by no means the usual splicing mechanism will work on a <=4nt stretch.

Nevertheless, I created ticket http://barnaserver.com/jira/browse/BARNA-191

Best,

micha

by gmichagmicha, 27 Jun 2012 09:53

Is it possible for a user to turn off the exon merging feature?

by duplexa (guest), 27 Jun 2012 03:57

Hey,

thank you for investigating, we looked into the scenario you reported: with the parameter file belowthough probably in another organismwe obtain an overall error rate of ~2.2%, as you expect. To compute this number, we determined the fraction of nucleotides with sequencing errors. This is relatively easy with the current workspace version, where we output alterated nucleotides as lowercase characters. To exclude biases from poly-A tales, we ignored stretches of 3 or more 'a' respectively 't' characters.

Can you provide us with a little bit more inforamtion about the simulator version you are using and the parameters you set?

Best

—- parameter file:
## File locations
REF_FILE_NAME mm9_refseq.gtf
GEN_DIR <genome_dir>

## Expression
NB_MOLECULES 5000000
TSS_MEAN 25
POLYA_SCALE 100
POLYA_SHAPE 2

## Fragmentation
FRAG_SUBSTRATE RNA
FRAG_METHOD UR
FRAG_UR_ETA 170
FRAG_UR_D0 1

## RT parameters
RTRANSCRIPTION YES
RT_MOTIF default
RT_PRIMER RH
RT_LOSSLESS YES
RT_MIN 500
RT_MAX 5500

## PCR / Filterint
PCR_DISTRIBUTION none
FILTERING YES
SIZE_SAMPLING AC

by ThassoThasso, 22 Jun 2012 17:41

So to follow up on this - I made a script that maps the reads you guys from FluxSimulator back to the transcriptome to create a .sam file, and then I viewed the results using Tablet.

The top of the picture is an alignment to a transcript using the ERRFILE 76 parameter that is provided by default, and the bottom is using an error model I made myself.

i.imgur dot com/mG4Dk.png

The default one is in no way realistic - no sequencer would give a >30% error rate - a 2% error rate is much more reasonable

Also, I am noticing that in all of my contigs, there is a stack of reads near the end of the transcript, as you can see in the picture (I have simulated paired end assembly). Is there a reason this is happening? I don't recall ever seeing such a thing for real data.

by Albi (guest), 20 Jun 2012 15:17
page 1123...next »
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License