Commit 15b4c62f authored by Konrad Zych's avatar Konrad Zych

merge conflict

parents a4c922c7 9de8f218
Pipeline #5404 passed with stage
in 1 minute and 39 seconds
......@@ -10,4 +10,3 @@ SIAMCAT.RProj
*.DS_Store*
inst/doc
*.pdf
......@@ -10,7 +10,7 @@ build:
stage: build
script:
- R CMD build ./ --no-build-vignettes
- R CMD INSTALL SIAMCAT_0.99.1.tar.gz
- R CMD INSTALL SIAMCAT_0.99.6.tar.gz
- mv SIAMCAT_* ./Rscript_flavor
- cd Rscript_flavor
- wget http://congo.embl.de/downloads/siamcat_test_data.tar.gz
......@@ -24,7 +24,7 @@ preprocess_data:
stage: preprocess_data
before_script:
- cd Rscript_flavor
- R CMD INSTALL SIAMCAT_0.99.1.tar.gz
- R CMD INSTALL SIAMCAT_0.99.6.tar.gz
script:
- Rscript 01_validate_data.r --metadata_in num_metadata_cancer-vs-healthy_study-pop-I_N141_tax_profile_mocat_bn_genus.tsv --metadata_out valMetaData.tsv --label_in label_cancer-vs-healthy_study-pop-I_N141_tax_profile_mocat_bn_genus.tsv --label_out valLabel.tsv --feat_in feat_cancer-vs-healthy_study-pop-I_N141_tax_profile_mocat_bn_genus.tsv --feat_out valFeat.tsv
- Rscript 02_select_samples.r --metadata_in valMetaData.tsv --metadata_out valMetaData_sel.tsv --label_in valLabel.tsv --label_out valLabel_sel.tsv --feat_in valFeat.tsv --feat_out valFeat_sel.tsv --filter_var="age" --allowed_range="[0,120]"
......@@ -47,7 +47,7 @@ train_model:
stage: train_model
before_script:
- cd Rscript_flavor
- R CMD INSTALL SIAMCAT_0.99.1.tar.gz
- R CMD INSTALL SIAMCAT_0.99.6.tar.gz
script:
- Rscript 09_train_models.r --feat_in valFeat_sel_norm.tsv --label_in valLabel_sel.tsv --method="lasso" --data_split dataSplit.Rdata --mlr_models_list models.RData --stratify="TRUE" --sel_criterion="auprc" --min_nonzero_coeff="2"
......@@ -64,7 +64,7 @@ make_predictions:
stage: make_predictions
before_script:
- cd Rscript_flavor
- R CMD INSTALL SIAMCAT_0.99.1.tar.gz
- R CMD INSTALL SIAMCAT_0.99.6.tar.gz
script:
- Rscript 10_make_predictions.r --label_in valLabel_sel.tsv --feat_in valFeat_sel_norm.tsv --data_split dataSplit.Rdata --pred pred.tsv --mlr_models_list models.RData
......@@ -81,9 +81,9 @@ evaluate_interpret:
stage: evaluate_interpret
before_script:
- cd Rscript_flavor
- R CMD INSTALL SIAMCAT_0.99.1.tar.gz
- R CMD INSTALL SIAMCAT_0.99.6.tar.gz
script:
- Rscript 11_evaluate_predictions.r --label_in valLabel_sel.tsv --plot evalPlot.pdf --pred pred.tsv
- Rscript 11_evaluate_predictions.r --feat_in valFeat_sel_norm.tsv --label_in valLabel_sel.tsv --plot evalPlot.pdf --pred pred.tsv
- Rscript 12_interpret_model.r --label_in valLabel_sel.tsv --feat_in valFeat_sel_norm.tsv --origin_feat valFeat_sel.tsv --metadata_in valMetaData_sel.tsv --mlr_models_list models.RData --plot interpretation.pdf --pred pred.tsv --col_scheme="BrBG" --heatmap_type="zscore" --consens_thres="0.5"
artifacts:
......
Summary
###Summary
(Summarize the bug encountered concisely)
Relevant logs and/or screenshots
###Relevant logs and/or screenshots
(Paste any relevant logs - please use code blocks (```) to format console output,
logs, and code as it's very hard to read otherwise.)
......
###Summary
(Summarize the requested feature encountered concisely)
###Affected functions/datasets/objects
(If possible, mention all functions/datasets/objects that this feature would have an effect on)
/label ~feature-request
/cc @zych
\ No newline at end of file
###Summary
### Summary
(Summarize what are you going to work on)
###Affected functions/datasets/objects
### Affected functions/datasets/objects
(If possible, mention all functions/datasets/objects that your changes will have an effect on)
### Goals/sub-features
- [ ]
/estimate 1h
/cc @zych
\ No newline at end of file
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
File mode changed from 100644 to 100755
Package: SIAMCAT
Type: Package
Title: Statistical Inference of Associations between Microbial Communities And host phenoTypes
Version: 0.99.1
Author: Georg Zeller [aut,cre], Jakob Wirbel[aut], Konrad Zych [aut], Nicolai Karcher[ctb] and Kersten Breuer[ctb]
Authors@R: c(person("Georg", "Zeller", role = c("aut", "cre"), email = "zeller@embl.de", comment = c(ORCID = "0000-0003-1429-7485")),
person("Konrad", "Zych", role = c("aut", "cre"), email = "konrad.zych@embl.de", comment = c(ORCID = "0000-0001-7426-0516")),
person("Jakob", "Wirbel", role = c("aut", "cre"), email = "jakob.wirbel@embl.de", comment = c(ORCID = "0000-0002-4073-3562"))
)
Maintainer: Konrad Zych <konrad.zych@embl.de>, Georg Zeller <eller@embl.de>, Jakob Wirbel <jakob.wirbel@embl.de>
Description: Pipeline for Statistical Inference of Associations between Microbial Communities And host phenoTypes (SIAMCAT). A primary goal of analyzing microbiome data is to determine changes in community composition that are associated with environmental factors. In particular, linking human microbiome composition to host phenotypes such as diseases has become an area of intense research. For this, robust statistical modeling and biomarker extraction toolkits are crucially needed. SIAMCAT provides a full pipeline supporting data preprocessing, statistical association testing, statistical modeling (LASSO logistic regression) including tools for evaluation and interpretation of these models (such as cross validation, parameter selection, ROC analysis and diagnostic model plots).
Title: Statistical Inference of Associations between Microbial
Communities And host phenoTypes
Version: 0.99.6
Authors@R: c(person("Georg", "Zeller", role = c("aut"),
email = "zeller@embl.de",
comment = c(ORCID = "0000-0003-1429-7485")),
person("Konrad", "Zych", role = c("aut", "cre"),
email = "konrad.zych@embl.de",
comment = c(ORCID = "0000-0001-7426-0516")),
person("Jakob", "Wirbel", role = c("aut"),
email = "jakob.wirbel@embl.de",
comment = c(ORCID = "0000-0002-4073-3562")),
person("Morgan", "Essex",
email = "morgan.essex@embl.de", role = c("ctb")),
person("Nicolai", "Karcher", role = c("ctb")),
person("Kersten", "Breuer", role = c("ctb")))
Description: Pipeline for Statistical Inference of Associations between
Microbial Communities And host phenoTypes (SIAMCAT). A primary goal
of analyzing microbiome data is to determine changes in community
composition that are associated with environmental factors. In particular,
linking human microbiome composition to host phenotypes such as diseases
has become an area of intense research. For this, robust statistical
modeling and biomarker extraction toolkits are crucially needed. SIAMCAT
provides a full pipeline supporting data preprocessing, statistical
association testing, statistical modeling (LASSO logistic regression)
including tools for evaluation and interpretation of these models (such as
cross validation, parameter selection, ROC analysis and diagnostic
model plots).
Depends:
R (>= 3.4.0),
mlr,
......@@ -22,18 +41,20 @@ Imports:
gridBase,
gridExtra,
LiblineaR,
matrixStats,
methods,
ParamHelpers,
pROC,
PRROC,
RColorBrewer,
stats,
utils,
methods
utils
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.0.1
biocViews: Metagenomics, Classification, Microbiome, Sequencing, Preprocessing, Clustering, FeatureExtraction
RoxygenNote: 6.0.1.9000
biocViews: Metagenomics, Classification, Microbiome, Sequencing, Preprocessing,
Clustering, FeatureExtraction, GeneticVariability, MultipleComparison,
Regression
Suggests:
BiocStyle,
optparse,
......
# Generated by roxygen2: do not edit by hand
export("data_split<-")
export("eval_data<-")
export("features<-")
export("label<-")
export("meta<-")
export("model_list<-")
export("norm_param<-")
export("orig_feat<-")
export("physeq<-")
export("pred_matrix<-")
export(accessSlot)
export(add.meta.pred)
export(check.associations)
export(check.confounders)
export(create.data.split)
export(create.label.from.metadata)
export(data_split)
export(eval_data)
export(evaluate.predictions)
export(features)
export(filter.features)
export(filter.label)
export(get.eval_data)
export(get.features)
export(get.features.matrix)
export(get.label)
export(get.label.info)
export(get.label.label)
export(get.label.list)
export(get.model.type)
export(get.model_list)
export(get.models)
export(get.phyloseq)
export(get.pred_matrix)
export(get.orig_feat.matrix)
export(label)
export(make.predictions)
export(meta)
export(model.evaluation.plot)
export(model.interpretation.plot)
export(model_list)
export(model_type)
export(models)
export(norm_param)
export(normalize.features)
export(orig_feat)
export(physeq)
export(pred_matrix)
export(read.features)
export(read.labels)
export(read.meta)
export(reset.features)
export(select.samples)
export(siamcat)
export(siamcat.to.lefse)
export(train.model)
export(validate.data)
exportClasses(data_split)
exportClasses(eval_data)
exportClasses(label)
exportClasses(model_list)
exportClasses(orig_feat)
exportClasses(pred_matrix)
exportClasses(siamcat)
import(LiblineaR)
import(glmnet, except = "auc")
import(glmnet, except = 'auc')
import(methods)
import(mlr)
import(phyloseq)
......@@ -84,6 +102,11 @@ importFrom(grid,pushViewport)
importFrom(gridBase,baseViewports)
importFrom(gridExtra,grid.table)
importFrom(gridExtra,ttheme_minimal)
importFrom(matrixStats,colRanks)
importFrom(matrixStats,rowMaxs)
importFrom(matrixStats,rowMedians)
importFrom(matrixStats,rowQuantiles)
importFrom(matrixStats,rowSds)
importFrom(pROC,roc)
importFrom(stats,addmargins)
importFrom(stats,coefficients)
......
#!/usr/bin/Rscript
###
# SIAMCAT - Statistical Inference of Associations between
# Microbial Communities And host phenoTypes
# R flavor
# EMBL Heidelberg 2012-2018
# GNU GPL 3.0
###
### SIAMCAT - Statistical Inference of Associations between
### Microbial Communities And host phenoTypes R flavor EMBL
### Heidelberg 2012-2018 GNU GPL 3.0
#' @title Add metadata as predictors
#' @description This function adds metadata to the feature matrix to be
#' later used as predictors
#' @description This function adds metadata to the feature matrix to be later
#' used as predictors
#' @usage add.meta.pred(siamcat, pred.names = NULL, std.meta =
#' TRUE, verbose = 1)
#' @param siamcat object of class \link{siamcat-class}
#' @param pred.names vector of names of the variables within the metadata to be
#' added to the feature matrix as predictors
#' added to the feature matrix as predictors
#' @param std.meta boolean, should added metadata features be standardized?,
#' defaults to \code{TRUE}
#' @param verbose control output: \code{0} for no output at all, \code{1}
#' for only information about progress and success, \code{2} for normal
#' level of information and \code{3} for full debug information,
#' defaults to \code{1}
#' defaults to \code{TRUE}
#' @param verbose control output: \code{0} for no output at all, \code{1} for
#' only information about progress and success, \code{2} for normal level of
#' information and \code{3} for full debug information, defaults to \code{1}
#' @keywords SIAMCAT add.meta.pred
#' @export
#' @return an object of class \link{siamcat-class} with metadata added to the
#' features
#' features
#' @examples
#' data(siamcat_example)
#' # Add the Age of the patients as potential predictor
#' siamcat_age_added <- add.meta.pred(siamcat_example, pred.names=c('age'))
#' data(siamcat_example)
#' # Add the Age of the patients as potential predictor
#' siamcat_age_added <- add.meta.pred(siamcat_example, pred.names=c('age'))
#'
#' # Add Age, BMI, and Gender as potential predictors
#' # Additionally, prevent standardization of the added features
#' siamcat_meta_added <- add.meta.pred(siamcat_example, pred.names=c('age',
#' 'bmi', 'gender'), std.meta=FALSE)
add.meta.pred <- function(siamcat, pred.names=NULL, std.meta=TRUE, verbose=1){
if(verbose>1) message("+ starting add.meta.pred")
s.time <- proc.time()[3]
### add metadata as predictors to the feature matrix
cnt <- 0
#' # Add Age, BMI, and Gender as potential predictors
#' # Additionally, prevent standardization of the added features
#' siamcat_meta_added <- add.meta.pred(siamcat_example, pred.names=c('age',
#' 'bmi', 'gender'), std.meta=FALSE)
add.meta.pred <- function(siamcat, pred.names = NULL, std.meta = TRUE,
verbose = 1) {
if (verbose > 1)
message("+ starting add.meta.pred")
s.time <- proc.time()[3]
### add metadata as predictors to the feature matrix
cnt <- 0
if (pred.names != '' && !is.null(pred.names)) {
if(verbose>2) message("+ starting to add metadata predictors")
for (p in pred.names) {
if(verbose>2) message("+++ adding metadata predictor:",p)
if(!p%in%colnames(siamcat@phyloseq@sam_data)) {
stop("There is no metadata variable called ",p)
}
idx <- which(colnames(siamcat@phyloseq@sam_data) == p)
if(length(idx) != 1) stop(p, "matches multiple columns in the metada")
if (pred.names != "" && !is.null(pred.names)) {
if (verbose > 2)
message("+ starting to add metadata predictors")
for (p in pred.names) {
if (verbose > 2)
message("+++ adding metadata predictor:", p)
if (!p %in% colnames(meta(siamcat))) {
stop("There is no metadata variable called ", p)
}
idx <- which(colnames(meta(siamcat)) == p)
if (length(idx) != 1)
stop(p, "matches multiple columns in the metada")
m <- unlist(siamcat@phyloseq@sam_data[,idx])
m <- unlist(meta(siamcat)[, idx])
if (!all(is.finite(m))) {
na.cnt <- sum(!is.finite(m))
if (verbose > 1) message(paste('++++ filling in', na.cnt, 'missing values by mean imputation'))
mn <- mean(m, na.rm=TRUE)
m[!is.finite(m)] <- mn
}
if (!all(is.finite(m))) {
na.cnt <- sum(!is.finite(m))
if (verbose > 1)
message(paste("++++ filling in", na.cnt,
"missing values by mean imputation"))
mn <- mean(m, na.rm = TRUE)
m[!is.finite(m)] <- mn
}
if (std.meta) {
if (verbose > 1) message(paste('++++ standardizing metadata feature', p))
m.mean <- mean(m, na.rm = TRUE)
m.sd <- sd(m, na.rm = TRUE)
stopifnot(!m.sd == 0)
m <- (m - m.mean)/m.sd
}
if (std.meta) {
if (verbose > 1)
message(paste("++++ standardizing metadata feature", p))
m.mean <- mean(m, na.rm = TRUE)
m.sd <- sd(m, na.rm = TRUE)
stopifnot(!m.sd == 0)
m <- (m - m.mean)/m.sd
}
features.with.meta <- otu_table(rbind(features(siamcat),m),
taxa_are_rows = TRUE)
rownames(features.with.meta)[nrow(features.with.meta)] <- paste(
"META_", toupper(p), sep = "")
features(siamcat) <- features.with.meta
siamcat@phyloseq@otu_table <- otu_table(rbind(siamcat@phyloseq@otu_table, m),taxa_are_rows=TRUE)
rownames(siamcat@phyloseq@otu_table)[nrow(siamcat@phyloseq@otu_table)] <- paste('META_', toupper(p), sep='')
cnt <- cnt + 1
}
if (verbose > 1) message(paste('+++ added', cnt, 'meta-variables as predictor to the feature matrix'))
} else {
if (verbose > 0) message('+++ Not adding any of the meta-variables as predictor to the feature matrix')
}
stopifnot(all(!is.na(siamcat@phyloseq@otu_table)))
e.time <- proc.time()[3]
if(verbose>1) message(paste("+ finished add.meta.pred in", formatC(e.time-s.time, digits=3) ,"s"))
if(verbose==1) message("Adding metadata as predictor finished")
return(siamcat)
cnt <- cnt + 1
}
if (verbose > 1)
message(paste("+++ added", cnt, "meta-variables as predictor to the
feature matrix"))
} else {
if (verbose > 0)
message("+++ Not adding any of the meta-variables as predictor to
the feature matrix")
}
e.time <- proc.time()[3]
if (verbose > 1)
message(paste("+ finished add.meta.pred in", formatC(e.time - s.time,
digits = 3), "s"))
if (verbose == 1)
message("Adding metadata as predictor finished")
return(siamcat)
}
This diff is collapsed.
......@@ -122,8 +122,7 @@ association.tests.with.metadata <- function(feat, label, meta, mult.corr, alpha,
### feature-wise anova on ranks for labels + metadata; f-ratio heatmaps? bubble plots?
feature.wise.anova.with.metadata <- function(feat, label, meta, pr.cutoff, marker.results, verbose) {
feature.wise.anova.with.metadata <- function(feat, label, meta, pr.cutoff, marker.results, verbose) {
meta <- factorize.metadata(meta)
# total sum of squares
......@@ -367,7 +366,6 @@ feature.glms <- function(feat, label, meta, verbose) {
auc.bars <- barplot(aucs, axes=FALSE, add=TRUE)
for (i in 1:nrow(feat)) {
segments(auc.bars[i], rocs[[i]]$ci[1], auc.bars[i], rocs[[i]]$ci[3])}
}
......@@ -402,4 +400,4 @@ barplot.legend <- function(color.scheme, value.range, type) {
axis(side=1, at=seq(0, 100, length.out=7), labels=key.ticks)
mtext(key.label, side=3, line=0.5, at=50, cex=0.7, adj=0.5)
}
\ No newline at end of file
}
This diff is collapsed.
This diff is collapsed.
#' Documentation for the example siamcat object in the data folder
#'
#' Reduced version of the CRC dataset in inst/extdata, containing 100 features
#' (15 associated features at 5\% FDR in the original dataset and 85 random
#' other features) and 141 samples, saved after the complete SIAMCAT pipeline
#' has been run.
#' Therefore, contains entries in every siamcat-object slot,
#' e.g, \code{eval_data} or \code{data_split}. Mainly used for running
#' the examples in the function documentation
#' (15 associated features at 5\% FDR in the original dataset and 85 random
#' other features) and 141 samples, saved after the complete SIAMCAT
#' pipelinehas been run.
#' Therefore, contains entries in every siamcat-object slot,
#' e.g, \code{eval_data} or \code{data_split}. Mainly used for running
#' the examples in the function documentation
#'
#' @name siamcat_example
#'
#' @docType data
#'
#' @keywords data
NULL
This diff is collapsed.
This diff is collapsed.
#' @import phyloseq mlr LiblineaR methods
#' @rawNamespace import(glmnet, except = "auc")
#' @rawNamespace import(glmnet, except = 'auc')
#' @importFrom grDevices col2rgb colorRampPalette dev.off gray pdf rgb
#' @importFrom graphics abline arrows axis barplot box strheight strwidth
#' boxplot grid image layout lines mtext hist points stripchart
#' par plot plot.new polygon rect text title segments legend
#' boxplot grid image layout lines mtext hist points stripchart
#' par plot plot.new polygon rect text title segments legend
#' @importFrom stats coefficients cor median p.adjust rnorm qqplot
#' predict quantile runif sd wilcox.test addmargins na.omit
#' fisher.test
#' predict quantile runif sd wilcox.test addmargins na.omit
#' fisher.test
#' @importFrom utils head read.table setTxtProgressBar tail
#' txtProgressBar
#' txtProgressBar
#' @importFrom PRROC pr.curve
#' @importFrom pROC roc
#' @importFrom beanplot beanplot
......@@ -17,5 +17,6 @@
#' @importFrom grid popViewport pushViewport plotViewport
#' @importFrom gridBase baseViewports
#' @importFrom gridExtra ttheme_minimal grid.table
#' @importFrom matrixStats rowQuantiles rowMaxs rowSds colRanks rowMedians
NULL
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#' @details
#'
#' SIAMCAT is a pipeline for Statistical Inference of Associations between
#' Microbial Communities And host phenoTypes. A primary goal of analyzing
#' microbiome data is to determine changes in community composition that are
#' associated with environmental factors. In particular, linking human
#' microbiome composition to host phenotypes such as diseases has become an area
#' of intense research. For this, robust statistical modeling and biomarker
#' extraction toolkits are crucially needed!
#'
"_PACKAGE"
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/Rscript
###
# SIAMCAT - Statistical Inference of Associations between
# Microbial Communities And host phenoTypes
# R flavor
# EMBL Heidelberg 2012-2018
# GNU GPL 3.0
###
#' The S4 class for storing models.
#' @name model_list-class
#' @rdname model_list-class
#' @slot models a list with models obtained from \link{train.model}
#' @slot model.type name of the method used by \link{train.model}
#' @exportClass model_list
setClass("model_list", representation(models = "list", model.type = "character"))
#' The S4 class for storing data splits
#' @name data_split-class
#' @rdname data_split-class
#' @slot training.folds a list - for each cv fold contains ids of
#' samples used for training
#' @slot test.folds a list - for each cv fold contains ids of
#' samples used for testing
#' @slot num.resample number of repetition rounds for cv
#' @slot num.folds number of folds for cv
#' @exportClass data_split
setClass("data_split", representation(training.folds = "list",
test.folds = "list",
num.resample = "numeric",
num.folds = "numeric"))
#' The S4 class for storing label info.
#' @name label-class
#' @rdname label-class
#' @slot label numeric vector, specifying to which category samples belong,
#' usualy made of 1s and -1s
#' @slot header contains information from the header of the label file
#' @slot info list with additional informations about the dataset
#' @slot positive.lab specifies which of two numbers in label is a positive label
#' @slot negative.lab specifies which of two numbers in label is a negative label
#' @slot n.idx numeric vector - on which positions in the label there are samples
#' withc negative label
#' @slot p.idx numeric vector - on which positions in the label there are samples
#' withc positive label
#' @slot n.lab character string with a name for the negative label (e.g. "healthy")