Skip to content
Snippets Groups Projects
Forked from Bio-IT Workshops / embl_hpc
Source project has a limited visibility.

Predoc course

  • Data cleaning (missing values imputation)
  • Normalization
  • PCA
  • LDA
  • K-means
  • Hierarchical clustering
  • SVM
  • Logistic regression
  • KNN

Data cleaning

  1. Load behavioral and neural data from the 'noisy' folders and check for missing values.
  2. provide estimates for all missing values by following different approaches:
    • manual
    • univariate imputer
    • multivariate imputer
    • knn

Normalization

  1. Given a time series X, imagine applying the three following scaling to it. How do you expect X to look like in these three cases after the transformation and why?

X_{scaled} = \frac{X - \bar{X}}{\sigma_X}

X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}

X_{scaled} = \frac{X}{|X|_{max}}

  1. Apply all three scalings to some fluorescence traces and visualize them before and after scaling.

PCA

  1. Apply PCA to a fluorescence dataset, visualise the data in the PC1-PC2 space, plot the factor loadings and print the explained variance.
  2. Try the exercise above with different fluorescence dataset to se if/how the projected data differ.
  3. Use the behavioral data to color-code the 2D points: is there any cluster?
  4. What happens if we scale the fluorescence dataset before performing PCA?

Agglomerative clustering

  1. Explore the blob dataset with different distance thresholds and linkages.
from sklearn.datasets import make_blobs

x, y = make_blobs(n_samples=100, centers=2, cluster_std=0.8, random_state=65)
  1. Apply the agglomerative clustering to a fluorescence dataset and visualize the result on a lower-dimensional representation of your data (e.g. PCA).
  2. Visualize the dendrogram and explore different numbers of clusters.

K-Means

  1. Apply K-Means clustering to a fluorescence dataset and visualize the result on a lower-dimensional representation of your data (e.g. PCA).
  2. Compare the result with the one obtained with Agglomerative Clustering. How do you evaluate the goodness of a clusterings?

LDA

  1. Apply LDA to fluorescence dataset and color by bahavioral label
  2. Compare result to PCA.

SVMs

  1. Apply SVM (for classification) to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data as classification labels. Compute the accuracy score.
  2. Compare different results obtained with different kernels.
  3. Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by SVM.

Logistic regression

  1. Apply Logistic regression to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data for as classification labels. Compute the accuracy score.
  2. Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by logistic regression.

KNN

  1. Apply KNN to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data for as classification labels. Explore different parameters for "weights" and compute the accuracy score.
  2. Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by KNN.

Hints

Data cleaning

  1. use pandas.DataFrame.isna and pandas.DataFrame.any
  2. sklearn.impute.SimpleImputer, sklearn.impute.IterativeImputer, sklearn.impute.KNNImputer. Note that this methods may require the data to be encoded as float values (pandas.DataFrame.replace)

Normalization
2. sklearn.preprocessing.StandardScaler, sklearn.preprocessing.MinMaxScaler, sklearn.preprocessing.MaxAbsScaler

PCA

  1. sklearn.decomposition.PCA

Agglomerative clustering 2. sklearn.cluster.AgglomerativeClustering 3. scipy.cluster.hierarchy.dendrogram, scipy.cluster.hierarchy.linkage

K-Means

  1. sklearn.cluster.KMeans
  2. sklearn.metrics.silhouette_score

LDA

  1. sklearn.discriminant_analysis.LinearDiscriminantAnalysis

SVMs

  1. sklearn.svm.SVC, sklearn.model_selection.train_test_split
  2. fit PCA on all data and transform only the test set.

Logistic Regression

  1. sklearn.linear_model.LogisticRegression, sklearn.model_selection.train_test_split
  2. fit PCA on all data and transform only the test set.

KNN 2. sklearn.neighbors.KNeighborsClassifier, sklearn.model_selection.train_test_split 3. fit PCA on all data and transform only the test set.