Snippets Groups Projects

Forked from Bio-IT Workshops / embl_hpc

Source project has a limited visibility.

authored

Name	Last commit	Last update
data_cleaning.ipynb
hierarchical_clustering.ipynb
k-means.ipynb
knn.ipynb
lda.ipynb
logistic_regression.ipynb
normalization.ipynb
pca.ipynb
readme.md
svms.ipynb

Predoc course

Data cleaning (missing values imputation)
Normalization
PCA
LDA
K-means
Hierarchical clustering
SVM
Logistic regression
KNN

Data cleaning

Load behavioral and neural data from the 'noisy' folders and check for missing values.
provide estimates for all missing values by following different approaches:
- manual
- univariate imputer
- multivariate imputer
- knn

Normalization

Given a time series X, imagine applying the three following scaling to it. How do you expect X to look like in these three cases after the transformation and why?

X_{scaled} = \frac{X - \bar{X}}{\sigma_X}

X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}

X_{scaled} = \frac{X}{|X|_{max}}

Apply all three scalings to some fluorescence traces and visualize them before and after scaling.

PCA

Apply PCA to a fluorescence dataset, visualise the data in the PC1-PC2 space, plot the factor loadings and print the explained variance.
Try the exercise above with different fluorescence dataset to se if/how the projected data differ.
Use the behavioral data to color-code the 2D points: is there any cluster?
What happens if we scale the fluorescence dataset before performing PCA?

Agglomerative clustering

Explore the blob dataset with different distance thresholds and linkages.

from sklearn.datasets import make_blobs

x, y = make_blobs(n_samples=100, centers=2, cluster_std=0.8, random_state=65)

Apply the agglomerative clustering to a fluorescence dataset and visualize the result on a lower-dimensional representation of your data (e.g. PCA).
Visualize the dendrogram and explore different numbers of clusters.

K-Means

Apply K-Means clustering to a fluorescence dataset and visualize the result on a lower-dimensional representation of your data (e.g. PCA).
Compare the result with the one obtained with Agglomerative Clustering. How do you evaluate the goodness of a clusterings?

LDA

Apply LDA to fluorescence dataset and color by bahavioral label
Compare result to PCA.

SVMs

Apply SVM (for classification) to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data as classification labels. Compute the accuracy score.
Compare different results obtained with different kernels.
Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by SVM.

Logistic regression

Apply Logistic regression to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data for as classification labels. Compute the accuracy score.
Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by logistic regression.

KNN

Apply KNN to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data for as classification labels. Explore different parameters for "weights" and compute the accuracy score.
Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by KNN.

Hints

Data cleaning

use pandas.DataFrame.isna and pandas.DataFrame.any
sklearn.impute.SimpleImputer, sklearn.impute.IterativeImputer, sklearn.impute.KNNImputer. Note that this methods may require the data to be encoded as float values (pandas.DataFrame.replace)

Normalization
2. sklearn.preprocessing.StandardScaler, sklearn.preprocessing.MinMaxScaler, sklearn.preprocessing.MaxAbsScaler

PCA

sklearn.decomposition.PCA

Agglomerative clustering 2. sklearn.cluster.AgglomerativeClustering 3. scipy.cluster.hierarchy.dendrogram, scipy.cluster.hierarchy.linkage

K-Means

sklearn.cluster.KMeans
sklearn.metrics.silhouette_score

LDA

sklearn.discriminant_analysis.LinearDiscriminantAnalysis

SVMs

sklearn.svm.SVC, sklearn.model_selection.train_test_split
fit PCA on all data and transform only the test set.

Logistic Regression

sklearn.linear_model.LogisticRegression, sklearn.model_selection.train_test_split
fit PCA on all data and transform only the test set.

KNN 2. sklearn.neighbors.KNeighborsClassifier, sklearn.model_selection.train_test_split 3. fit PCA on all data and transform only the test set.