We will use the analysis pipeline NanoCLUST to characterise the relative abundance of species in our kefir community.
NanoCLUST takes as input the sequencing reads as fastq files and returns a csv file with the relative abundances of the species which it identified to be present in the sample.
For more information on how NanoCLUST works have a look at the Applications Note.
First, you will need to copy the NanoCLUST pipeline from another location to your home $HOME.
cp -r /mnt/data/NanoCLUST/ ${HOME}
Then you need to define a variable NCROOT
with the path
to it. We also need to create a variable to indicate at which of the
hubs you are located. Replace HD_hub
or
BA_hub
dpending on where you are.
NCROOT="${HOME}/NanoCLUST/"
HUB=<your_hub>
## 2. Specify your barcode and locate the data To save time, we will split up the analysis of the different barcodes amongst you. Set the variable barcode to the barcode that was assigned to you (replace 05 in the code junk with your barcode number):
barcode=barcode05
The data are on a shared directory to which we create a variable as
well. For the further analyis we use the reads that passed quality
control. You will find them in fastq_pass
. We change
directory to there.
DTROOT="/g/teachinglab/data/MCD2022/${HUB}/TestData"
cd $DTROOT/fastq_pass
The files containing the reads have the ending .fastq.gz.
The files containing the reads have the ending .fastq.gz.
The sequencer partitions the reads for each barcode into files with 4000 reads each or provides all reads in one file per barcode. You first have to concatenate the individual files before you analyse them with NanoCLUST.
To do so, first navigate to the folder that contains the single files
cd $DTROOT/fastq_pass/$barcode
Then, concatenate all files in the folder and save the new file one
level up in the path under the name
<your_barcode>_concat_fastq.gz
like this:
cat * > ../"${barcode}_concat.fastq.gz"
To run NanoCLUST, first navigate to the root path of the copy of NanoCLUST on your VM.
cd $NCROOT
Define variables for the path to the data and the path that NanoCLUST should write its output to (relative to the NanoCLUST root path)
data_dir="${DTROOT}/fastq_pass"
output_dir="/g/teachinglab/data/MCD2022/${HUB}/OutputData/${barcode}"
Now you are ready to run NanoCLUST. Note that the command below can fail initially, in which case it should be sufficient to start the command a second time.
nextflow run main.nf -profile conda --umap_set_size 130000 \
--cluster_sel_epsilon 0.25\
--min_cluster_size 50 \
--reads "${data_dir}/${barcode}_concat.fastq.gz" \
--db db/16S_ribosomal_RNA --tax db/taxdb \
--outdir ${output_dir}
When NanoCLUST is finished you will find your results in
$output_dir/<your_barcode>/
, where $barcode
corresponds to the barcode you got assigned.
To prepare for the visualization of the NanoCLUST results, the NanoCLUST results for all individual barcodes need to be combined in one file.
One of the trainers will do that for you
input_dir="/g/teachinglab/data/MCD2022/${HUB}/OutputData/"
output_file=${HOME}/embo_mcd22/data/NanoClust_output/combined_full2.csv
> ${output_file}
echo "sample,tax,taxid,rel_abundance" > ${output_file}
for tax in F G O S; do
echo ${tax}
while read f; do
sample=$(basename ${f} | sed "s/\.fastq_${tax}\.csv//g")
tail -n +2 ${f} | sed "s/^/${sample},${tax},/g" >> ${output_file}
done< <(find ${input_dir} -name "*.fastq_${tax}.csv" | sort)
done
To perform visualisation of the results move to the this notebook.