I am getting the following error with the latest nightly build (I am trying to simulate expression from a reference genome with many contigs, hence many short chromosomes)
SEQUENCING] getting the reads
Initializing Fragment Index
Indexing ********** OK (00:00:03)
2818814 lines indexed (2818814 fragments, 2630 entries)
sequencing [WARN] skipped line ACCK01000001.1 mgb2gtf gene 148 1200 . + . transcript_id "CATMIT_00001"; gene_id "CATMIT_00001";
*[ERROR] Error while sequencing : Problems reading BED object ACCK01000003.1 14785 14861 ACCK01000003.1:14115-14861W:CATMIT_00041:5:747:466:746:A/2 0 - . . 0,0,0 1 76 0
java.lang.RuntimeException: Problems reading BED object ACCK01000003.1 14785 14861 ACCK01000003.1:14115-14861W:CATMIT_00041:5:747:466:746:A/2 0 - . . 0,0,0 1 76 0
at barna.model.bed.BEDobject2.readSequence(BEDobject2.java:335)
at barna.flux.simulator.Sequencer.createQSeq(Sequencer.java:600)
at barna.flux.simulator.Sequencer$SequenceWriter.writeRead(Sequencer.java:936)
at barna.flux.simulator.Sequencer$Processor.process(Sequencer.java:771)
at barna.flux.simulator.Sequencer.sequence(Sequencer.java:247)
at barna.flux.simulator.Sequencer.call(Sequencer.java:120)
at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:436)
at barna.flux.simulator.SimulationPipeline.call(SimulationPipeline.java:53)
at barna.commons.launcher.Flux.main(Flux.java:212)
Caused by: java.lang.RuntimeException: Problems reading chromosome ACCK01000003.1 from 14785 to 14780
This is because the corresponding genome file does indeed not contain sequence at that range:
grep -v '>' genomes/ACCK01000003.1.fa | grep -o '[ATGC]' | wc
14780 14780 29560
(The file is a fasta header and one long line of sequence so the length isn't affected by whitespace characters)
But I'm not sure why it's trying to sequence from there anyway:
grep 'CATMIT_00041' gtf/Catenibacterium_mitsuokai_DSM_15897.gtf
ACCK01000003.1 mgb2gtf gene 14015 14761 . + . gene_id "CATMIT_00041"; transcript_id "CATMIT_00041";
ACCK01000003.1 mgb2gtf CDS 14015 14761 . + . gene_id "CATMIT_00041"; transcript_id "CATMIT_00041";
ACCK01000003.1 mgb2gtf exon 14015 14761 . + . gene_id "CATMIT_00041"; transcript_id "CATMIT_00041";
Here are my parameters:
## File locations
REF_FILE_NAME gtf/Catenibacterium_mitsuokai_DSM_15897.gtf
GEN_DIR genomes
TSS_MEAN NaN (I usually use 25, I am using NaN to debug, but it does not make a difference)
#
FRAG_SUBSTRATE RNA
##
POLYA_SCALE NaN
POLYA_SHAPE NaN
##
RTRANSCRIPTION NO
## Expression
NB_MOLECULES 20000000
FRAG_METHOD UR
## Sequencing
READ_NUMBER 35149
READ_LENGTH 76
PAIRED_END YES
## FILTERING
FILTERING YES
SIZE_SAMPLING AC
TMP_DIR tmp
FASTA true
ERR_FILE empirical.err
PCR_DISTRIBUTION none
RT_LOSSLESS false
FRAG_UR_ETA 300
SIZE_DISTRIBUTION distribution.txt
It seems changing the TSS does not solve it (it is on the positive strand anyway).