Skip to content
Snippets Groups Projects

Loop tracing in python.

LoopTrace is a Python package for processing and analysing 3D DNA tracing data from multiplexed FISH.

Installation

The simplest way to install the package and all dependencies is to clone the repository and create a new environment using the environment.yml file. This will install the looptrace environment primarily used for image processing and trace extraction. A second environment, specified in environment_analysis.yml, contains a different selection of packages required for postprocessing, analysis and visualization of results. The two environments are compatible and can also be combined into one by manually installing a few extra packages if desired (see respective environment files).

To install, use a terminal (e.g. a miniforge prompt, https://github.com/conda-forge/miniforge):

git clone https://git.embl.de/grp-ellenberg/looptrace
cd looptrace
conda env create -f environment.yml
conda activate looptrace
pip install .

For the analysis environment:

git clone https://git.embl.de/grp-ellenberg/looptrace
cd looptrace
conda env create -f environment_analysis.yml
conda activate looptrace_analysis
pip install .

If desired, deconvolution of images can be performed to improve spot fitting efficiency tracing, this is done using the flowdec package (https://github.com/hammerlab/flowdec). A GPU is strongly recommended, and this requires Tensorflow version compatible with your GPU and CUDA versions. To install flowdec with Tensorflow for GPU:

conda activate looptrace
pip install flowdec[tf_gpu]

If using CellPose for nucleus segmentation:

conda activate looptrace
pip install cellpose

Basic usage (tracing):

The LoopTrace package contains a set of python modules (located in the looptrace directory) and python command-line scripts (located in bin/cli directory) to conveniently run a number of processing steps. These scripts are especially designed for use in a HPC environment, but can also be used on a single PC. The scripts can be run directly from the command line, or more conveniently using jupyter notebooks (see examples directory). Each program contains a minimal documentation for usage describing required and optional command line options.

Data and module organization

The main looptrace modules handle reading of images and tabular data (ImageHandler.py), deconvolution (Deconvolver.py), image registration (Drifter.py), nucleus segmentation (NucDetector.py), tracing region segmentation (SpotDetector.py), and spot fitting and trace extraction (Tracer.py). General functions are located in separate files in the looptrace directory, including image reading and writing (image_io.py), general image processing functions (image_processing_functions.py) and tracing analysis functions (trace_analysis_functions.py).

Examples of direct usage of the modules are found in the relevant command line scripts in the bin/cli directory.

Most command line scripts require two inputs: the location of a master experiment-specific config file that defines all user-adjustable parameters, and the location of a directory containing the image files. An example of such a config file, with documentation of the adjustable parameters, is found in the examples directory.

The image folder is expected to contain subfolders with images in any of the following formats:

  • Sequential 3D (single stack) Nikon nd2 images with naming scheme Time00000_Point0000_Channel546,638_Seq0000.nd2 (example of first stack). This is the standard naming format of Nikon JOBS output.
  • Typically, data is converted to ZARR format to reduce storage requirements and improve throughput. The ZARR ZipStore format is preferred to reduce the number of files, but other ZARR formats (e.g OME-ZARR images) are compatible. The script assumes that in each subfolder there will be a series of zip files or directories of ZARR images that correspond to separate fields of view, each with 3D stacks with one or more time-points and channels.
  • Series of numpy arrays (ending in .npy) with a sequential naming scheme.

In addition, directly in the top level image folder the following formats will be loaded:

  • Single Zip files ending in .npz containing a series of numpy arrays.
  • Single numpy arrays

Internally, all files are virtually loaded into a list of numpy or Dask arrays, where each list entry is a field of view in the format of TCZYX (time/frame, channel, 3D stack). If other file formats were to be used, these would be compatible with all modules if they can be read into the same (Dask) array structure. See image_io.py and ImageHandler.py for inspiration.

Processing preparation

  • Create a master experiment folder with "analysis" and "images" sub-directories.
  • Copy the example config file into the analysis directory and update parameters appropriate for the experiment.
  • Place the sequential images and other images (e.g. corresponding DAPI or IF images) in the images directory, using the image organization as described above.
  • Run the appropriate command line scripts directly or following the examples in the jupyter notebooks in the example folder to complete the processing.

Typical workflow

  • convert_datasets_to_zarr.py: Images are converted to ZARR format. Number of fields of view is manually set here, and recommended to use --zipstore option to enforce use of ZARR ZipStore format. Output: Subfolder in the image directory (or other directory if specified) with images in the chosen ZARR format.
  • nuc_image_save.py: Save a separate set of (usually DAPI-stained) nuclei given in config file either in 3D or 2D as single slices or maximum projections for conventient segmentation. Output: nuc_images subfolder in images directory.
  • nuc_label.py: Segment nuc_images using segmentation approach defined in config file, either a simple threshold, using CellPose or a trained random forest model (requires a file called nuc_rf_model.pickle in analysis folder, generated in a separate random forest training notebook, see example in examples folder). Output: nuc_masks subfolder in images directory.
  • nuc_label.py: (Optional) Apply classification to nuc_images based on nuc_masks segmentation. Currently no automatic classification routines are implemented, so will just create a new dataset ready for manual classification. Output: nuc_classes subfolder in images directory.
  • nuc_label_qc.py (Optional) Manually refine nuc_masks segmentations or nuc_classes classifications using napari. 0 is generally background, otherwise classification scheme is up to user.
  • chromatic_abberation.py: If tracing in multiple channels, calculate chromatic abberation correction between a reference channel and other channels. This is applied in post-processing. Output: experiment_name_image_name_frame_XX_ref_ch_Y_mov_chs_[Z]_affine.npy in analysis directory.
  • extract_exp_psf.py: Experimental PSF is calculated by averageing fiducial bead stacks located in images directory. Set which image to extract from in the config file, the segmentation parameters are reused from drift correction fiducial segmentation parameters. Output: exp_psf.npy in the images directory.
  • decon.py: Images are deconvolved using experimental or theoretical PSF (flowdec, experimental PSF recommended). Set which image and which channels to deconvolve in the config file, all frames/timepoints are deconvolved by default. Output: input_image_name_decon subfolder in images folder containing ZARR images.
  • drift_correct.py: Drift correction/registration is calculated based on cross-correlation of fiducial bead signal at the pixel-level and then at the subpixel level either by calculating centroid offsets of segmented fiducials (recommended) or subpixel cross-correlation of segmented fiducials. 3D rigid translation only. A template (fixed) image in the same or different image is set in config. Parameters for fiducial segmentation also set in config file. Output: experiment_name_input_image_name_drift_correction.csv tabular data in analysis folder with calculated offsets at pixel and sub-pixel level. For more challenging drift correction, drift_correct_elastix.py is provided, requires the ITK elastix package and addition configuration files.
  • (Optional) save_dc_images.py: Save the indicated images after drift correction has been calculated and applied to visualize 3D or maximum projections. Usually only done on a few fields of view for quality control and visualization purposes, as downstream analysis will typically not use this dataset. Output: input_image_name_dc subfolder in images directory.
  • detect_spots.py: Segment spots based on regional or library-specific barcode signals, defined parameters for this in config file. Can subtract fiducial signal using blank frames to reduce segmentation of fiducials. Can segment based on defined thresholds in intesity image, after difference of gaussians filter or using a random forest model (random forest is usually most robust). Random forest needs to be pre-trained in separate notebook to generate spots_rf_model.pickle in analysis folder. Input can be both non-deconvolved or deconvolved, often non-deconvolved works best if signal is sufficient. Use detect_spots_preview.py to test parameters and preview segmentation results. This should be done in looptrace_analysis environment, as it requires napari. Output: experiment_name_input_image_name_rois.csv tabular data defining the detected tracing ROIs.
  • extract_spots_table.py: Generates a new table of all the rois to be extracted from the sequential images that takes the previously calculated drift correction into account and is used for extraction of ROI images and tracing. Output: experiment_name_input_image_name_rois_dc.csv in analysis folder.
  • extract_spots.py: Extracts the ROIs defined by the ...rois_dc.csv file from the input image. The input image here is usually the deconvolved images to boost detection/fitting efficiency. Output: Subfolder in images directory named "spot_images_dir" with 5D (TCZYX) numpy arrays (.npy) named after original field of view and ROI ID.
  • tracing.py: Performs 3D Gaussian centroid fitting on the images in the trace_input_image folder (i.e. spot_images_dir). Different parameters can be used, but default parameters generally work well. Output: experiment_name_input_image_name_traces.csv with fitting parameters (used for quality control during analysis) and centroid positions of fits (z, y, x) defined in the reference system of the full field of view drift correction reference frame (for making overlays).
  • assign_spots_to_nucs.py: If nuc_masks and optionally nuc_classes have been created, this script will assign ROIs and traces (if files are present in analysis folder) to a nucleus segmentation and classification label or 0 (background, outside of segmented nuclei). Can be used for filtering ROIs/traces outside nuclei, and assigning traces to different cell classes based on defined classes. Note: if there are prevalant issues of segmentation of out-of-nucleus background signals, this script can be run directly after the detect_spots.py script, and the ..._rois.csv table can be filtered to only contain spots found inside nuclei before proceeding. Please reset the index before manually overwriting the ...rois.csv file. Output: Overwrites ..._rois.csv and ..._traces.csv files adding extra columns with nuc_label and nuc_class assignments to each fit.
  • full_chromosome_decode.py: Alternative to tracing.py for fitting and decoding multiplexed labeling of e.g. whole chromosomes, assuming a combined segment-spot decoding scheme. Needs additional parameters in the config file, especially a list of the probe sets used to automitically identify spot fitting and decoding frames. Output: experiment_name_input_image_name_full_chromosome_decode.csv

Slurm cluster usage

All of the command-line scripts are created for use in an HPC environment, especially a Slurm cluster. Parallelisation is done across fields of view, which is generally designated to each node using an array job. The submit_cluster_job.py is a helper file to automatically create SBATCH files for running the scripts described above on the cluster. An example of cluster processing through a jupyter notebook is located in the examples folder. The cluster_analysis_cleanup.py script should be run after all the scripts that output tabular data are run. This script combines the individual .csv files output from each cluster node before proceeding.

Postprocessing and analysis

Once the _traces.csv or full_chromosome_decode.csv tabular files are created, postprocessing typically proceeds in a jupyter notebook using functions in trace_analysis_functions.py, including:

  • Assigning genomic positions to fits by cross-referencing the probe libraries.
  • Quality control based on fitting parameters (signal to background, gaussian standard deviation, overall trace length)
  • Chromatic abberation correction (affine_transform_points function, where the output of chromatic_abberation.py is the affine matrix used as input and applied to all points from the non-reference channel, has to be applied to points in pixel units before nm-conversion)
  • Visualization of median pairwise distance maps (plot_heatmap function)
  • Visualization of single traces (trace_3d_mayavi function)
  • Procrustes analysis (general_procrustes_analysis function)
  • Calculating single trace metrics (trace_metrics function, full_chr_analysis function for additional metrics relevant to full chromosomes)
  • Clustering approaches such as Fiedler seriation (fiedler_seriation function) or different scikit-learn based clustering approaches (pwd_clustering function)

Authors

Created by Kai Sandvold Beckwith (kai.beckwith@embl.de), Ellenberg group, CBB, EMBL Heidelberg.

License

MIT

Citation

Please cite our paper describing "LoopTrace": https://www.biorxiv.org/content/10.1101/2021.04.12.439407v5