doc="Barcode file (tsv) matching sample names to barcode combination. \n"+
"### SIMPLE Barcode File Format (for backward compatibility) ; please see the GENERAL format described below"+
"The file must have 2 columns with the sample in col1 and the corresponding barcode in col2.\n"+
"In this format, a simple BARCODE slot is expected in the ReadLayout and NO headers are needed e.g. :\n"+
"\t"+"\t"+"sample1\tGAGG\n"+
"\t"+"\t"+"sample2\tCCAA\n"+
"\t"+"The format accept the following shortcuts: \n"+
"\t"+"1. If multiple barcodes map to the same sample, either line can be duplicated e.g.\n"+
"\t"+"\t"+"sample1\tATAT\n"+
"\t"+"\t"+"sample1\tGAGG\n"+
"\t"+"\t"+"sample2\tCCAA\n"+
"\t"+"\t"+"sample2\tTGTG\n"+
"\t"+"Or barcodes can be combined using the OR operator '|' i.e. the file above can be re-written like\n "+
"\t"+"\t"+"sample1\tATAT|GAGG\n"+
"\t"+"\t"+"sample2\tCCAA|TGTG\n"+
"\t"+"2. For the special situation of paired-end data in which barcodes differ at both ends i.e. with "+
"BARCODE1 and BARCODE2 described for read one and two respectively, barcodes for BARCOD1 and BARCODE2 can be"+
" distinguished using a ':' separator i.e. \n"+
"\t"+"\t"+"sample1\tATAT:GAGG\n"+
"\t"+"\t"+"sample2\tCCAA:TGTG\n"+
"\t"+"This above syntax means that sample 1 is encoded with ATAT barcode from BARCODE1 slot AND GAGG barcode from BARCODE2 slot. "+
"Note that you can still combine barcodes using | e.g. \n"+
"\t"+"\t"+"sample1\tATAT|GAGG:CCAA|TGTG\n"+
"\t"+"3. Extended barcode file format : 3 (single-end) or 4 (paired-end) tab-delimited colums\n"+
"\t"+"same as the simple barcode file format but the extra columns contains the file name(s) to use to name output files."+
" A unique extra column is expected for single-end while 2 extra columns are expected for paired-end. In case, lines are duplicated (multiple barcodes"+
"mapping the same sample), the same file name should be indicated in the third (and fourth) column(s). \n"+
doc="Enables emBASE mode i.e fetch information from emBASE and place demultiplexed files directly in emBASE repository structure.\n"+
"This option is mutually exclusive with BARCODE_FILE.\n"+
"Note : this option forces O=null GZ=true UN=true UF1=null UF2=null STATS_ONLY=false (all other user options supported).\n"
)
publicbooleanUSE_EMBASE=false;
@Option(shortName="RL",optional=true,
printOrder=20,
doc="Describes the read layout(s) i.e. 'RL=1:<BARCODE1:6><SAMPLE:x>' ; with the ':' character as the delimiter.\n"+
"The first number (i.e. from '1:' above) is the read layout index and it must be unique across all 'RL' inputs as "+
"it is used to match up the read layouts with the fastq files.\n"+
"This number is optinal but highly recommended when more than one RL (or FASTQ) input are provided (e.g. paired end situation).\n"+
"Read layouts are only needed for complex layouts and when the read layout(s) was/were not already embedded in the FASTQ files options.\n"+
"This option must not be given when read layout were embedded in the FASTQ option"+
"When not provided, the index is inferred from the comamnd line order"
)
publicList<String>READ_LAYOUT;
@Option(shortName="OL",optional=true,
printOrder=20,
doc="Describes the output file layout(s) using the slots defined in read layouts and ':' to delimitate three parts e.g. 'OL=1:<BARCODE1><UMI1><UMI2>:<SAMPLE1>' : \n"+
"\t"+"1.The number in the first part (i.e. from '1:' above) is the output file index and it must be unique across all 'OL' inputs. "+
"Inferred from order in comamnd line when not given\n"+
"\t"+"2.The second part (i.e. '<BARCODE1><UMI1><UMI2>' above) is the read header layout ; when writing multiple UMI and BARCODE slots "+
"in output read headers, these are always separated with the RCHAR (':' by defaults).\n"+
"\t"+"3.The third part (i.e. '<SAMPLE1>' above) is the read sequence layout.\n"+
"One output file is created for each sample and each OL index. Output file names default to samplename_outputfileindex with the original extensions\n"+
"### When no OL is described, Je considers that an output file should be created for each input FASTQ (containing a SAMPLE slot) and for each sample.\n "+
"In this scenario:\n"+
"\t"+"1. The output files only contain the BARCODE slots (concatenated if multiple BARCODE slots are described within the same read layout) unless CLIP is set to false\n"+
"\t"+"2. The barcode(s) and sample names are injected in the output file names occording to the pattern 'FASTQFILENAMEn_SAMPLENAME_BARCODES.ORIGINALEXTENSIONS' ) \n"+
"\t"+"3. All UMI slots (if any) are placed in the fastq headers following their slot index i.e. UMI1:UMI2:...:UMIn, separated with ':' (all UMIs are added to all "
+"file sets identically) ; unless ADD is et to false."
)
publicList<String>OUTPUT_LAYOUT;
@Override
protectedString[]customCommandLineValidation(){
/*
* Parse and validate the barcode file
* we first check a valid file was given , then guess the format, convert if necessary and parse it
*/
if(!BARCODE_FILE.canRead()){
returnnewString[]{
"File is not readable :"+BARCODE_FILE.getAbsolutePath()
};
}
FilebcFileToParse=BARCODE_FILE;
try{
CSVLinel=FileUtil.readFirstValidLine(BARCODE_FILE.getAbsolutePath(),"\t",false);//hasHeaders to false to make sure to have the header line if existing
returnnewString[]{"Comamnd line option FASTQ="+FASTQ.get(i)+" is not valid, a maximum of three slots delineated with '"+FASTQ_OPTION_SLOT_DELIMITER+"' is allowed)"};
}
//the three info to extract from the FASTQ option
Stringfilepath=fileDescr;//the only mandatory one => init with fileDescr in case no index nor read layout were given
intfileindex=i;// init with position in command line, in case it is not given in the option
ReadLayoutrl=null;
if(parts.length>=2){
//we have multiple parts, the file path is the last slot , always
filepath=parts[parts.length-1];
//parts[0] CANNOT be empty (see above)
try{
fileindex=Integer.parseInt(parts[0]);
}catch(NumberFormatExceptionnfe){
//no it is not; then it is a ReadLayout but only if user provided only 2 parts
if(parts.length>2){
returnnewString[]{"Comamnd line option FASTQ="+FASTQ.get(i)+" is not valid: when three slots delineated with '"+FASTQ_OPTION_SLOT_DELIMITER+"' are used, the first one must be the file index)"};
}
try{
rl=newReadLayout(parts[0]);
}catch(Exceptione){
log.error(ExceptionUtil.getStackTrace(e));
returnnewString[]{e.getMessage()+"\nCannot make sense of command line option FASTQ="+FASTQ.get(i)+"! Please check the doc"};
}
}
}
if(parts.length==3){
//then the file index was successfully parsed and parts[1] MUST be a ReadLayout or an empty slot ie '1;;file.txt'
try{
if(StringUtils.isNotBlank(parts[1]))
rl=newReadLayout(parts[1]);
}catch(Exceptione){
log.error(ExceptionUtil.getStackTrace(e));
returnnewString[]{e.getMessage()+"\nInvalid read layout in comamnd line option FASTQ="+FASTQ.get(i)+"! Please check the doc"};
}
}
//save in tmp maps
if(idx2file.containsKey(fileindex)){
returnnewString[]{"All FASTQ index must be unique but at least 2 FASTQ options have been assigned with the index "+fileindex+"; for example: FASTQ="+FASTQ.get(i)};