Hi,
In my last post, I mentioned about the error during "PCR amplification". The error was described mainly due to TSS_MEAN parameter which at times crosses the valid boundary of a chromosome. I have been experimenting with other parameters I find necessary and I have not had success with most of them I require. I'd like to point them out in the hopes that they would get fixed. I am using flux-simulator-1.0.2 and the recent nightly build as pointed by micha after fixing the negative location access as well. The ones I mention here are tested in both. I'll post the parameter file once, for which it works, and then quote the changes that are not possible (leading to error) under each sub-section.
Parameter file that works!
LOAD_CODING true
LOAD_NONCODING true
NB_MOLECULES 8000000
EXPRESSION_K -0.6
EXPRESSION_X0 9500.0
EXPRESSION_X1 9.025E7
TSS_MEAN NaN
POLYA_SCALE NaN
POLYA_SHAPE NaN
RTRANSCRIPTION true
RT_LOSSLESS true
RT_MAX 5500
RT_MIN 50
RT_PRIMER RH
FRAGMENTATION true
FRAG_SUBSTRATE RNA
FRAG_METHOD UR
FRAG_UR_D0 1
FRAG_UR_DELTA NaN
FILTERING true
PCR_DISTRIBUTION default
PAIRED_END true
READ_LENGTH 76
ERR_FILE 76
READ_NUMBER 15000000
FASTA true
1) Error: TSS_MEAN still gives error. Yes, the nightly build version also gives error when using PCR amplification.
TSS_MEAN 25
POLYA_SCALE NaN
POLYA_SHAPE NaN
This doesn't work, ends up with the same error: I was not able to perform even filtering before. But now, it passes that stage, but fails at PCR amplification.
[INFO] Loading default PCR distribution
preparing transcript sequences ******* ERROR
[ERROR] Error while preparing sequences: Problems reading sequence Chr3: pos -1, len 100,
...
FAILED
[ERROR] Error while fragmenting : Problems reading sequence Chr3: pos -1, len 100,
check whether chromosomal sequence exists / has the correct size
java.lang.RuntimeException: Problems reading sequence Chr3: pos -1, len 100,
check whether chromosomal sequence exists / has the correct size
... 8 more
2) Error: SIZE_DISTRIBUTION with N(mean, sd) won't work. It just doesn't.
FILTERING true
SIZE_DISTRIBUTION N(200,50)
SIZE_SAMPLING AC
I get [Error] 100. I have tried with UR and with NB fragmentation, I have tried setting FRAG_UR_ETA and by not setting ETA. Nothing works if I give N(mean, sd). Note that the size distribution mean is being set to NaN.
[LIBRARY] creating the cDNA libary
[INFO] **Size Distribution : Distribution mean: NaN**
...
Processing Fragments ********** OK (00:05:13)
71948002 mol: in 71914801, new 33201, out 71931246
avg Len 165.75595, maxLen 575
initializing Selected Size distribution
[ERROR] 100
3) question: When I use the parameter file listed at the very beginning, that is, no SIZE_DISTRIBUTION, TSS_MEAN, POLYA_SCALE, POLYA_SHAPE and it runs successfully, The UR fragmentation results in average fragment length of 188 with maximum of 555 and a total of 76 million molecules. And when doing a size selection on this, the output average is 181 with max: 299 and gives me 2.1 million molecules which is approximately 3% of total molecules from fragmentation. How is this possible? The average is 188 out of 76 million and you get at the end 2 million between 181 and 299…?? It just doesn't add up, does it? Or did I get it completely wrong?
[LIBRARY] Configuration
Mode: RH
PWM: No
RT MIN: 50
RT MAX: 5500
Processing Fragments ********** OK (00:05:34)
76054743 mol: in 76027642, new 27101, out 76054743
avg Len 156.77885, maxLen 555
initializing Selected Size distribution
[LIBRARY] Segregating cDNA (Acceptance)
Processing Fragments ********** OK (00:02:05)
76054743 mol: in 76054743, new 0, out 2057527
avg Len 181.46268, maxLen 299
start amplification
4) question: When I continue to PCR amplification of these 2 million fragments, it gives me 55 million fragments and then the fastq file. Now, when I map them using tophat to my genome, only 30% of the reads mapped. When I tried to find the error by mapping using BWA, I found that the inner distance between the reads were around 500-1200. Again, how is this possible? BWA is not a spliced read aligner and it gives me inner distance this high when my fragments are supposedly between 181 and 300 (most of them anyways, assuming there would be more reads from non-spliced regions than splice junctions). When you mean fragment size, does it include adapters? If so, assuming 35bp on both side the effective insert size is 300-70 = 230 bp. If you mean insert size as fragment, then its 300. Eitherways, the inner distance between the two pairs must be between (230 - 76*2)=78 and (300-76*2)=148 on an average. But when I calculated from my script to obtain mean and sd of inner distance, mean=360 and sd=760 (after mapping). Is there an explanation for this? Or am I understanding something totally wrong? This is not possible unless almost all such reads are generated over splice junctions… isn't it?
BWA output:
[infer_isize] (25, 50, 75) percentile: (96, 212, 732)
[infer_isize] inferred external isize from 133773 pairs: 406.571 +/- 449.596
[infer_isize] inferred maximum insert size: 3199 (6.21 sigma)
...
[infer_isize] (25, 50, 75) percentile: (234, 705, 1613)
[infer_isize] inferred external isize from 141160 pairs: 962.510 +/- 918.874
[infer_isize] inferred maximum insert size: 6568 (6.10 sigma)
I am attaching my genomic fasta and gtf here. It would be great if you could look in to these errors/questions.
Best,
Arun.