Jean-Karim Heriche · 40577294
--- a/Using-the-IDE.md
+++ b/Using-the-IDE.md
@@ -14,7 +14,7 @@ Upon starting, the app opens with the data input workspace. Other tools can only
 ### Data input
 This is where the data table is uploaded to the app. Because the app makes few assumptions on the table content, it is up to the user to indicate which columns contain relevant information. The data input workspace is divided into boxes corresponding to the different input required from the user:
 * **Input data file**  
-This is where the tabular data file is selected and uploaded to the app server. Before uploading a file, make sure that the file has a header.  
+This is where the tabular data file is selected and uploaded to the app server. Before uploading a file, make sure that it conforms to the specifications laid out in the section [Preparing the data for use with the IDE](https://git.embl.de/heriche/image-data-explorer/-/wikis/Preparing%20the%20data%20for%20use%20with%20the%20IDE).  
 * **Plot variables**  
 This allows to select one or two columns whose values will be shown in the plot area. If two columns are selected but the plot type needs only one, the first selected column will be used. If the values are associated with plate/well information, they will be rendered as a colour gradient on the wells of the corresponding plate in the plate viewer. When only one column is selected, it will be rendered on the y-axis of the scatterplot with the index of the data points on the x-axis. 
 * **Additional variables to display on hover**  
@@ -22,7 +22,7 @@ By default when hovering over a point in the plot, the values of the plot variab
 * **Columns to hide**  
 Hiding columns minimizes the amount of horizontal scrolling needed to reach columns on the right-hand side of the table when all columns can't fit on screen.
 * **Groups**  
-This allows to select one column whose values will be used to set colours for the points in the scatterplot and split the values of the plot variable for plotting one histogram per group. Only 9 distinct colours are available so any selected column with more than 9 distinct values will be ignored.
+This allows to select one column whose values will be used to set colours for the points in the scatterplot and split the values of the plot variable for plotting one histogram per group. Only 24 distinct colours are available so any selected column with more than 24 distinct values will be ignored.
 * **Images**  
 If images are associated with the data file, select the image root directory. The image root directory is the top level directory relative to which the image file paths are given in the data table. The image root directory can be an S3-compatible object store. Then select one or two columns containing the paths to the images relative to the selected root directory. If a column name contains the pattern 'image.*path', this column will be preselected in the image 1 field.  
 * **ROIs**  
@@ -32,11 +32,11 @@ If the rows correspond to time points only (i.e. with no ROI definition), select
 High-throughput microscopy is often carried out in multiwell plates and each row of the data table is associated with a plate and a well of that plate. This is where the columns containing plate and well information are selected. When images from multiple fields of views inside a well are available, these should be identified in a separate column selected under column for fields/positions. 
 * **Save parameters**
 The information entered into the other boxes (except for the input data file) can be saved and downloaded into a configuration file in rds format. When a data file is uploaded, a browse button will appear allowing selection and upload of a previously saved configuration file. Upon upload of this file, input boxes will be populated with the saved values from the file.  
-Currently no attempt is made at checking the validity of an uploaded configuration file. Mismatches between the configuration file values and the column names of the uploaded data file can result in unpredictable behaviour.
+No attempt is made at checking the validity of an uploaded configuration file. **Mismatches between the configuration file values and the column names of the uploaded data file can result in unpredictable behaviour and crashes**.

 ### Explore
-The explore workspace is where the interactive data visualization happens. It is divided into 3 areas:
-* **A plot** area on the top left of the screen. By default, this shows a scatterplot of the variables selected in the data input section. If columns for plates and wells have been selected, a plate viewer is also available. Clicking on a data point in the plot or a well in the plate viewer selects it in the data table below and opens the corresponding image(s). If the point is associated with x,y coordinates then a red dot is added to the image(s) at the position given by these coordinates.
+The explore workspace is where the interactive data visualization takes place. It is divided into 3 areas:
+* **A plot** area on the top left of the screen. By default, this shows a scatterplot of the variables selected in the data input section. If columns for plates and wells have been selected, a plate viewer is also available. Clicking on a data point in the plot or a well in the plate viewer selects it in the data table below and opens the corresponding image(s). If the point is associated with x,y coordinates then a red dot is added to the image(s) at the position given by these coordinates. If a group column has been selected, the corresponding labels will be shown in the plot legend. Clicking on items in the legend will hide/show the corresponding points.
 * **An image viewer** area next to the plot area. This is where images selected under image 1 in the data input section will appear.
 Clicking on the image selects the corresponding row in the data table and highlights the corresponding point in the plot. If rows correspond to ROIs then the click position is indicated by a red dot and the data point corresponding to the closest ROI in the image is selected in both the data table and the plot. Pressing the shift key while clicking anywhere on the image enters the multiple selection mode where each subsequent shift+click is recorded and indicated by a cyan dot. Clicking anywhere on the image without pressing shift exits the multiple selection mode. When zoomed in, the keyboard arrow keys can be used to move the field of view. A list of actions available in the image viewer is available by pressing h. 
 * **A data table** area at the bottom of the screen. The data table shows the content of the uploaded data file. A tab allows switching to a second image viewer where images selected under image 2 in the data input section will appear. This second image viewer behaves like the one described above. Clicking on a table row selects it and highlights the corresponding point in the plot and in the image viewers. No image is shown when selecting multiple rows unless the corresponding objects belong to the same image. The table is searchable globally using the 'Search' box in the top right corner above the table or by column using the boxes atop each column. Searches filter the rows to be displayed in the table. To select all the rows and highlight them in the plot, click the button labeled 'Show filtered rows in plot' above the table. To deselect all selected rows, click the 'Clear selection' button. To annotate the selected rows, click the 'Annotate selection' button. This is only available if an annotation column has been chosen in the 'Annotate' section.
@@ -45,7 +45,7 @@ Clicking on the image selects the corresponding row in the data table and highli
 Individual data points, i.e. rows of the data table, can be associated with a label. Annotation starts by visiting the 'Annotate' workspace (accessible from the sidebar). There, a column to hold the annotations can be selected and labels defined. If an existing column is selected, its distinct values will be available as labels. New labels can also be added. Alternatively, a new column can be created, in which case, new labels must be provided. New labels must be entered as a comma-separated list. Once done, choices must be confirmed by clicking the 'Apply' button. Annotations can then be performed using the 'Annotate selection' button in the 'Explore' workspace.

 ### Dimensionality reduction
-To help visualize the overall structure of the data, several numerical variables can be combined into a 2d projection. The numerical columns and method to use can be selected in the 'Dimensionality reduction' section. Application of a dimensionality reduction method results in the creation of two new columns containing new coordinates for all data points. When running the same method multiple times, coordinates columns are re-used (i.e. new columns are not created for each new run). Upon successful completion of the dimensionality reduction, the new columns are automatically selected for plotting and the view switches back to the 'Explore' workspace.
+To help visualize the overall structure of the data, several numerical variables can be combined into a 2d projection. The numerical columns and method to use can be selected in the 'Dimensionality reduction' section. Application of a dimensionality reduction method results in the creation of two new columns containing new coordinates for all data points. **When running the same method multiple times, coordinates columns are re-used (i.e. new columns are not created for each new run)**. Upon successful completion of the dimensionality reduction, the new columns are automatically selected for plotting and the view switches back to the 'Explore' workspace.

 ### Classification and feature selection
 In this workspace, data points can be classified using the XGBoost implementation of gradient boosted decision trees. Its input are a set of numerical features and a training set consisting of rows annotated with the classes to consider in the selected target annotation column. The IDE does 5-fold cross-validation using 2/3 of the annotated data for training and 1/3 for validation. It outputs a plot of feature importance and some statistics on the classifier performance. The plot shows the features ranked by importance (using the gain score which measures the improvement in accuracy by using the feature) and how these features cluster together. The classifier can be applied to the whole data set with the outcome put into an additional column named xgboost.predictions.