Predoc course
- Data cleaning (missing values imputation)
- Normalization
- PCA
- LDA
- K-means
- Hierarchical clustering
- SVM
- Logistic regression
- KNN
Data cleaning
- Load behavioral and neural data from the 'noisy' folders and check for missing values.
- provide estimates for all missing values by following different approaches:
- manual
- univariate imputer
- multivariate imputer
- knn
Normalization
- Given a time series X, imagine applying the three following scaling to it. How do you expect X to look like in these three cases after the transformation and why?
X_{scaled} = \frac{X - \bar{X}}{\sigma_X}
X_{scaled} = \frac{X - X_{min}}{X_{max} - X_{min}}
X_{scaled} = \frac{X}{|X|_{max}}
- Apply all three scalings to some fluorescence traces and visualize them before and after scaling.
PCA
- Apply PCA to a fluorescence dataset, visualise the data in the PC1-PC2 space, plot the factor loadings and print the explained variance.
- Try the exercise above with different fluorescence dataset to se if/how the projected data differ.
- Use the behavioral data to color-code the 2D points: is there any cluster?
- What happens if we scale the fluorescence dataset before performing PCA?
Agglomerative clustering
- Explore the blob dataset with different distance thresholds and linkages.
from sklearn.datasets import make_blobs
x, y = make_blobs(n_samples=100, centers=2, cluster_std=0.8, random_state=65)
- Apply the agglomerative clustering to a fluorescence dataset and visualize the result on a lower-dimensional representation of your data (e.g. PCA).
- Visualize the dendrogram and explore different numbers of clusters.
K-Means
- Apply K-Means clustering to a fluorescence dataset and visualize the result on a lower-dimensional representation of your data (e.g. PCA).
- Compare the result with the one obtained with Agglomerative Clustering. How do you evaluate the goodness of a clusterings?
LDA
- Apply LDA to fluorescence dataset and color by bahavioral label
- Compare result to PCA.
SVMs
- Apply SVM (for classification) to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data as classification labels. Compute the accuracy score.
- Compare different results obtained with different kernels.
- Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by SVM.
Logistic regression
- Apply Logistic regression to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data for as classification labels. Compute the accuracy score.
- Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by logistic regression.
KNN
- Apply KNN to a training subset of a fluorescence dataset and test on the remainind data. Use some categorical behavioral data for as classification labels. Explore different parameters for "weights" and compute the accuracy score.
- Apply PCA to the whole dataset and visualize only the test. Color-code using the labels predicted by KNN.
Hints
Data cleaning
- use pandas.DataFrame.isna and pandas.DataFrame.any
- sklearn.impute.SimpleImputer, sklearn.impute.IterativeImputer, sklearn.impute.KNNImputer. Note that this methods may require the data to be encoded as float values (pandas.DataFrame.replace)
Normalization
2. sklearn.preprocessing.StandardScaler, sklearn.preprocessing.MinMaxScaler, sklearn.preprocessing.MaxAbsScaler
PCA
- sklearn.decomposition.PCA
Agglomerative clustering 2. sklearn.cluster.AgglomerativeClustering 3. scipy.cluster.hierarchy.dendrogram, scipy.cluster.hierarchy.linkage
K-Means
- sklearn.cluster.KMeans
- sklearn.metrics.silhouette_score
LDA
- sklearn.discriminant_analysis.LinearDiscriminantAnalysis
SVMs
- sklearn.svm.SVC, sklearn.model_selection.train_test_split
- fit PCA on all data and transform only the test set.
Logistic Regression
- sklearn.linear_model.LogisticRegression, sklearn.model_selection.train_test_split
- fit PCA on all data and transform only the test set.
KNN 2. sklearn.neighbors.KNeighborsClassifier, sklearn.model_selection.train_test_split 3. fit PCA on all data and transform only the test set.