Title: | Deconvolution of Bulk RNA-Seq Data using Single-Cell RNA-Seq Data as Reference |
---|---|
Description: | Streamlined workflow from deconvolution of bulk RNA-seq data to downstream differential expression and gene-set enrichment analysis. Provide various visualization functions. |
Authors: | Yuanhang Liu [aut, cre] |
Maintainer: | Yuanhang Liu <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.1 |
Built: | 2024-11-26 05:28:50 UTC |
Source: | https://github.com/liuy12/scdeconr |
Generate artificial bulk RNA-seq samples with random or pre-defined cell-type proportions for benchmarking deconvolution algorithms
bulk_generator( ref, phenodata, num_mixtures = 500, num_mixtures_sprop = 10, pool_size = 100, seed = 1234, prop = NULL, replace = FALSE )
bulk_generator( ref, phenodata, num_mixtures = 500, num_mixtures_sprop = 10, pool_size = 100, seed = 1234, prop = NULL, replace = FALSE )
ref |
a matrix-like object of gene expression values with rows representing genes, columns representing cells. |
phenodata |
a data.frame with rows representing cells, columns representing cell attributes. It should at least contain the first two columns as:
|
num_mixtures |
total number of simulated bulk samples. Have to be multiple of |
num_mixtures_sprop |
number of simulated bulk samples with the same simulated cell type proportions. Only applicable when |
pool_size |
number of cells to use to construct each artificial bulk sample. Default to 100. |
seed |
seed to use for simulation. Default to 1234. |
prop |
a data.frame with two columns. The first column includes unique cell types in phenodata; the second column includes cell type proportions. If specified, bulk samples will be simulated based on the specified cell proportions. |
replace |
logical value indicating whether to sample cells with replacement. Default to FALSE, to sample cells without replacement. |
If prop
is not specified, cell type proportions will be firstly randomly generated with at least two cell types present. Then, for each cell proportion
vector, num_mixtures_sprop
number of samples is simulated. Eventually, a total of num_mixtures
number of samples is simulated. If prop is
specified, then a total of num_mixtures
number of samples will be simulated based on the same cell proportion vector specified.
a list of two objects:
simulated bulk RNA-seq data, with rows representing genes, columns representing samples
cell type proportions used to simulate the bulk RNA-seq data, with rows representing cell types, columns representing samples
## Not run: ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, num_mixtures_sprop = 1) ## End(Not run)
## Not run: ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, num_mixtures_sprop = 1) ## End(Not run)
Compute cell type specific gene expression based on predicted cell proportions and reference data.
celltype_expression(bulk, ref, phenodata, prop, ...)
celltype_expression(bulk, ref, phenodata, prop, ...)
bulk |
a matrix-like object of bulk RNA-seq data with rows representing genes, columns representing samples |
ref |
a matrix-like object of scRNA-seq data with rows representing genes, columns representing cells. |
phenodata |
a data.frame with rows representing cells, columns representing cell attributes. It should at least contain the first two columns as:
|
prop |
a matrix-like object of cell proportion values with rows representing cell types, columns representing samples. |
... |
additional parameters passed to |
this function is inspired by cell-type specific gene expression estimation for doublet mode in spacexr
. See examples from run_de
.
a list with length equal to number of unique cell types in phenodata. Each element in the list represents gene expression matrix for each unique cell type.
Generate a scatter plot of fold changes comparing two differential expression results, e.g. w/wo adjusting for cell proportion differences.
comparedeg_scatter( results1, results2, result_names = NULL, fc_cutoff, pval_cutoff, pvalflag = TRUE, interactive = FALSE )
comparedeg_scatter( results1, results2, result_names = NULL, fc_cutoff, pval_cutoff, pvalflag = TRUE, interactive = FALSE )
results1 |
a data.frame containing differential expression results with five columns: "Gene name", "log2 fold change", "log2 average expression",
"p value", "adjusted p value". The second element of the output from function |
results2 |
similar to |
result_names |
a vector of length 2 indicating the names of the two differential results. If NULL, names will be set to c("results1", "results2") |
fc_cutoff |
fold change cutoff to identify differential expressed genes. |
pval_cutoff |
p value cutoff to identify differential expressed genes. |
pvalflag |
a logical value indicating whether to use adjusted p value in selecting differential expressed genes. |
interactive |
a logical value indicating whether to generate an interactive plot. |
See examples from run_de
.
Generate a scatter plot of normalized enrichment scores comparing two results for gene set enrichment analysis, e.g. w/wo adjusting for cell proportion differences.
comparegsea_scatter( gseares_path1, gseares_path2, result_names = NULL, nes_cutoff = 2, pval_cutoff = 0.1, pvalflag = TRUE, interactive = FALSE )
comparegsea_scatter( gseares_path1, gseares_path2, result_names = NULL, nes_cutoff = 2, pval_cutoff = 0.1, pvalflag = TRUE, interactive = FALSE )
gseares_path1 |
path to GSEA output. |
gseares_path2 |
path to a second GSEA output. |
result_names |
a vector of length 2 indicating the names of the two GSEA results. If NULL, names will be set to c("results1", "results2") |
nes_cutoff |
normalized enrichment score cutoff to identify enriched gene-sets. |
pval_cutoff |
p value cutoff to identify enriched gene-sets. |
pvalflag |
a logical value indicating whether to use adjusted p value in selecting enriched gene-sets. Default to TRUE. |
interactive |
a logical value indicating whether to generate an interactive plot. Default to FALSE. |
this function does not support output from GSEA R implementation
Compute RMSE, bias & variance metrics for predicted cell proportions by comparing with expected cell proportions.
compute_metrics(prop_pred, prop_sim)
compute_metrics(prop_pred, prop_sim)
prop_pred |
a matrix-like object of predicted cell proportion values with rows representing cell types, columns representing samples. |
prop_sim |
a matrix-like object of simulated/expected cell proportion values with rows representing cell types, columns representing samples. |
a list of two objects:
a data.fame of summary metrics containing RMSE, bias & variance grouped by cell types and mixture ids (simulated samples with the same expected cell proportions).
a data.frame of aggregated RMSE values across all cell types within each sample.
## Not run: ## generate artificial bulk samples ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, num_mixtures_sprop = 1) ## perform deconvolution based on "OLS" algorithm decon_res <- scdecon(bulk = bulk_sim[[1]], ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, filter_ref = TRUE, decon_method = "OLS", norm_method = "none", trans_method = "none", marker_strategy = "all") ## compute metrics metrics_res <- compute_metrics(decon_res[[1]], bulk_sim[[2]]) ## End(Not run)
## Not run: ## generate artificial bulk samples ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, num_mixtures_sprop = 1) ## perform deconvolution based on "OLS" algorithm decon_res <- scdecon(bulk = bulk_sim[[1]], ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, filter_ref = TRUE, decon_method = "OLS", norm_method = "none", trans_method = "none", marker_strategy = "all") ## compute metrics metrics_res <- compute_metrics(decon_res[[1]], bulk_sim[[2]]) ## End(Not run)
Integration of scRNA-seq or snRNA-seq data using either harmony or seurat.
construct_ref( ref_list, phenodata_list, data_type = c("cellranger", "h5", "matrix"), method = c("harmony", "seurat"), group_var, nfeature_rna = 200, percent_mt = 40, vars_to_regress = c("percent_mt", "phase"), ex_features = NULL, cluster = TRUE, resolution = 0.8, verbose = TRUE, ... )
construct_ref( ref_list, phenodata_list, data_type = c("cellranger", "h5", "matrix"), method = c("harmony", "seurat"), group_var, nfeature_rna = 200, percent_mt = 40, vars_to_regress = c("percent_mt", "phase"), ex_features = NULL, cluster = TRUE, resolution = 0.8, verbose = TRUE, ... )
ref_list |
a character vector of data paths to scRNA-seq/snRNA-seq. See |
phenodata_list |
a character vector of data paths to metadata for elements in ref_list. All metadata within phenodata_list should have consistent column names. Columns represent
cell attributes, such as cell type, rows represent cells. Each element in
|
data_type |
data type of the input scRNA-seq/snRNA-seq data. Could be either a single character value from "cellranger", "h5", "matrix", or a vector/list of values with the same length as ref_list indicating the data type for each element. |
method |
character value specifying the method to use. Has to be one of "harmony" or "seurat". See details for more information. |
group_var |
a vector of character values indicating which variables within phenodata_list metadata to use for integration. Only applicable when method is set to "harmony". |
nfeature_rna |
minimum # of features with non-zero UMIs. Cells with # of features lower than nfeature_rna will be removed. Default to 200. |
percent_mt |
maximum percentage of mitochondria (MT) mapped UMIs. Cells with MT percentage higher than percent_mt will be removed. Default to 40. |
vars_to_regress |
a list of character values indicating the variables to regress for SCTransform normalization step. Default is to regress out MT percentage ("percent_mt") & cell cycle effects ("phase") |
ex_features |
a vector of character values indicating genes to exclude from anchor features. Those genes will not be considered as anchor features for integration, but will still be present in the integrated data. |
cluster |
logical value indicating whether to perform clustering on the integrated data. If TRUE, unsupervised clustering will be performed, and the results will be saved in "seurat_clusters" metadata in the output Seurat object. |
resolution |
numeric value specifying resolution to use when cluster is set to TRUE. |
verbose |
logical value indicating whether to print messages. |
... |
additional parameters passed to |
data_type can be chosen from:
path to a directory containing the matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv files outputted by 10x's cell-ranger
path to .h5 file outputted by 10x's cell-ranger
path to a matrix-like file, with rows representing genes, columns representing cells.
SCTransform with vst.flavor = "v2"
is used for normalization of individual data. Integration methods can be chosen from either "harmony"
or "seurat". Harmony typically is more memory efficient and, recommended if you have large # of cells for integration.
a Seurat-class
object.
## Not run: ## random subset of two scRNA-seq datasets for breast tissue ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) ## Register backend for parallel processing registerDoFuture() plan("multisession", workers = 4) ## construct integrated reference data refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt") ## End(Not run)
## Not run: ## random subset of two scRNA-seq datasets for breast tissue ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) ## Register backend for parallel processing registerDoFuture() plan("multisession", workers = 4) ## construct integrated reference data refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt") ## End(Not run)
The model was learnt using seven paired FFPE-FFzn samples from BBD patients
ffpemodel
ffpemodel
a gam Object
from mgcv
Heatmap to demonstrate enrichment of selected gene-sets.
gsea_heatmap( normdata, teststats, gmtfile, numgenes, gsname_up, gsname_down, anncol, color, anncolors = NULL, rankcol = TRUE, zscore_range = c(-3, 3) )
gsea_heatmap( normdata, teststats, gmtfile, numgenes, gsname_up, gsname_down, anncol, color, anncolors = NULL, rankcol = TRUE, zscore_range = c(-3, 3) )
normdata |
a matrix-like object of normalized & untransformed bulk RNA-seq data, with rows representing genes and columns representing samples.
The first element of the output from function |
teststats |
a data.frame containing differential expression results with five columns: "Gene name", "log2 fold change", "log2 average expression",
"p value", "adjusted p value". The second element of the output from function |
gmtfile |
path to gmt file used for GSEA analysis. |
numgenes |
Number of genes to include in the heatmap. Will choose |
gsname_up |
a character value indicating selected up-regulated gene-set. |
gsname_down |
a character value indicating selected down-regulated gene-set. |
anncol |
a data.frame of sample meta information to include as column annotation bars. See option annCol from |
color |
color used for heatmap. See option color option from |
anncolors |
optional data.frame to define colors for column annotations in |
rankcol |
a logical value indicating whether to sort samples based on correlation between fold change & gene expression for better visualization. Default to TRUE. |
zscore_range |
a vector of length two indicating the desired range of z-score transformed data. Default to c(-3, 3). |
this function does not support output from GSEA R implementation
Generate high-quality GSEA random-walk figures.
gsea_rwplot(gseares_path, gsname, class_name, metric_range = NULL)
gsea_rwplot(gseares_path, gsname, class_name, metric_range = NULL)
gseares_path |
path to GSEA output. |
gsname |
a character value indicating the gene-set name to generate random-walk plot. |
class_name |
a character value indicating the class of the gene-set, e.g. "GO". |
metric_range |
optional range of the ranking metric. |
this function does not support output from GSEA R implementation. Scripts initially implemented by Thomas Kuilman.
Summary plot of selected up/down regulated gene-sets for gene set enrichment analysis.
gsea_sumplot( gseares_path, pos_sel, neg_sel, pvalflag = TRUE, interactive = FALSE )
gsea_sumplot( gseares_path, pos_sel, neg_sel, pvalflag = TRUE, interactive = FALSE )
gseares_path |
path to GSEA output. |
pos_sel |
a character vector of upregulated gene-set names. |
neg_sel |
a character vector of downregulated gene-set names. |
pvalflag |
a logical value indicating whether to use adjusted p value in the plot. Default to TRUE. |
interactive |
a logical value indicating whether to generate an interactive plot. Default to FALSE. |
this function does not support output from GSEA R implementation
Load and preprocess scRNA-seq/snRNA-seq data using seurat SCTransform workflow.
load_scdata( ref, data_type = c("cellranger", "h5", "matrix"), meta_info, nfeature_rna = 200, percent_mt = 40, cc.genes = NULL, vars_to_regress = c("percent_mt", "phase"), id, verbose, ... )
load_scdata( ref, data_type = c("cellranger", "h5", "matrix"), meta_info, nfeature_rna = 200, percent_mt = 40, cc.genes = NULL, vars_to_regress = c("percent_mt", "phase"), id, verbose, ... )
ref |
path to scRNA-seq/snRNA-seq data. |
data_type |
a character value specifying data type of the input scRNA-seq/snRNA-seq data, should be one of "cellranger", "h5", "matrix". |
meta_info |
a data.frame with rows representing cells, columns representing cell attributes. |
nfeature_rna |
minimum # of features with non-zero UMIs. Cells with # of features lower than nfeature_rna will be removed. Default to 200. |
percent_mt |
maximum percentage of mitochondria (MT) mapped UMIs. Cells with MT percentage higher than percent_mt will be removed. Default to 40. |
cc.genes |
cell-cycle genes curated by Seurat. Can be loaded via |
vars_to_regress |
a list of character values indicating the variables to regress for SCTransform normalization step. Default is to regress out MT percentage ("percent_mt") & cell cycle effects ("phase") |
id |
a character value specifying project or sample id. Only used for printing purposes. |
verbose |
logical value indicating whether to print messages. |
... |
additional parameters passed to |
For more details, refer to construct_ref
a Seurat-class
object.
## Not run: samplepath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1") samplepath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2") ref_list <- c(samplepath1, samplepath2) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) tmp <- load_scdata( ref = ref_list[[1]], data_type = c("cellranger"), meta_info = fread(file = phenodata_list[[1]], check.names = FALSE, header = TRUE), nfeature_rna = 50, vars_to_regress = c("percent_mt"), id = 1, verbose = TRUE) ## End(Not run)
## Not run: samplepath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1") samplepath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2") ref_list <- c(samplepath1, samplepath2) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) tmp <- load_scdata( ref = ref_list[[1]], data_type = c("cellranger"), meta_info = fread(file = phenodata_list[[1]], check.names = FALSE, header = TRUE), nfeature_rna = 50, vars_to_regress = c("percent_mt"), id = 1, verbose = TRUE) ## End(Not run)
Prepare .rnk file for GSEA preranked analysis
prepare_rnk(teststats, outputfile, replace = FALSE)
prepare_rnk(teststats, outputfile, replace = FALSE)
teststats |
a data.frame containing differential expression results with five columns: "Gene name", "log2 fold change", "log2 average expression",
"p value", "adjusted p value". The second element of the output from function |
outputfile |
full path including file name to .rnk file. |
replace |
a logical value indicating whether to replace the output file if it already exists. Default to FALSE. |
Bar plot of cell type proportions across samples
prop_barplot(prop, sort = TRUE, interactive = FALSE)
prop_barplot(prop, sort = TRUE, interactive = FALSE)
prop |
a matrix or data.frame of cell proportion values with rows representing cell types, columns representing samples. |
sort |
a logical value indicating whether to sort the samples based on cell type with highest median cell proportion across samples. Default to TRUE. |
interactive |
a logical value indicating whether to generate interactive plot. Default to FALSE. |
Reformat, read & write .gmt file.
reformat_gmt(gmtfile, outputfile, replace = FALSE) read_gmt(gmtfile) write_gmt(gmt, outputfile, replace = FALSE)
reformat_gmt(gmtfile, outputfile, replace = FALSE) read_gmt(gmtfile) write_gmt(gmt, outputfile, replace = FALSE)
gmtfile |
path to a gene set definition file in .gmt format. |
outputfile |
full path including file name to export reformatted .gmt file. |
replace |
a logical value indicating whether to replace the output file if it already exists. Default to FALSE. |
gmt |
a gmt object returned by |
reformat_gmt
replaces blank spaces within the gene-set names to help string-matching methods in downstream plot functions gsea_sumplot
,gsea_rwplot
, gsea_heatmap
.
for read_gmt, returns a list object with length equal to the total number of gene sets within the .gmt file. Each list contains three elements:
"id", "name", "genes". read_gmt & write_gmt are reimplemented based on functions from ActivePathways
package.
Performing differential expression analysis adjusting for cell proportion differences, and other covariates using a additive model.
run_de( bulk, prop = NULL, sampleinfo, control = NULL, case = NULL, de_method = c("edgeR", "DESeq2", "limma_voom", "limma"), padj_method = "BH", ... )
run_de( bulk, prop = NULL, sampleinfo, control = NULL, case = NULL, de_method = c("edgeR", "DESeq2", "limma_voom", "limma"), padj_method = "BH", ... )
bulk |
a matrix-like object of gene expression values with rows representing genes, columns representing samples |
prop |
a matrix-like object of cell proportion values with rows representing cell types, columns representing samples. Default to NULL, not adjust for cell proportion. |
sampleinfo |
a data.frame of metadata for the samples. Rows represents samples; columns represents covariates to adjust for. The first column of sampleinfo should contains group information for differential analysis. |
control |
a character value indicating the control group in sampleinfo. Set to NULL to perform ANOVA-like analysis. |
case |
a character value indicating the case group in sampleinfo. Set to NULL to perform ANOVA-like analysis. |
de_method |
a character value indicating the method to use for testing differential expression. Should be one of "edgeR", "DESeq2", "limma_voom", "limma" |
padj_method |
method for adjusting multiple hypothesis testing. Default to "BH". See |
... |
parameters pass to DE methods. |
To perform ANOVA like analysis (differences between any groups), set control
& case
options to NULL
and choose one of the following methods:
edgeR, limma_voom or limma. DESeq2 does not provide direct support for this type of comparison.
a list of two elements:
a data.frame containing normalized gene expression data.
a data.frame containing detailed differential expression statistics. Columns represent "Gene name", "log2 fold change", "log2 average expression", "p value", "adjusted p value" respectively
## Not run: ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) ## construct a vector with same proportions across cell types prop1 <- data.frame(celltypes = unique(refdata$celltype), proportion = rep(0.125, 8)) ## simulate 20 bulk samples based on specified cell type proportion bulk_sim1 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, prop = prop1, replace = TRUE) ## generate another vector with high proportion for a certian cell type prop2 <- data.frame(celltypes = unique(refdata$celltype), proportion = c(0.8, 0.1, 0.1, rep(0, 5))) bulk_sim2 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, prop = prop2, replace = TRUE) ## compare data for differential analysis bulk_sim <- list(cbind(bulk_sim1[[1]], bulk_sim2[[1]]), cbind(bulk_sim1[[2]], bulk_sim2[[2]])) ## force to be integer for DE purposes bulk <- round(bulk_sim[[1]], digits=0) colnames(bulk) <- paste0("sample", 1:ncol(bulk)) ## predict cell type proportions using "OLS" algorithm decon_res <- scdecon(bulk = bulk, ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, filter_ref = TRUE, decon_method = "OLS", norm_method = "none", trans_method = "none", marker_strategy = "all") ## create sampleinfo sampleinfo <- data.frame(condition = rep(c("group1", "group2"), each =20)) row.names(sampleinfo) <- colnames(bulk) library(DESeq2) deres <- run_de(bulk = bulk, prop = decon_res[[1]], sampleinfo = sampleinfo, control = "group1", case = "group2", de_method = "edgeR") ## run differential analysis without adjusting for cell proportion differences deres_notadjust <- run_de(bulk = bulk, prop = NULL, sampleinfo = sampleinfo, control = "group1", case = "group2", de_method = "edgeR") ## scatter plot to compare the effect of adjusting cell proportion differences comparedeg_scatter(results1 = deres[[2]], results2 = deres_notadjust[[2]], result_names = c("adjust for cell proportion", "not adjust for cell proportion"), fc_cutoff = 1.5, pval_cutoff = 0.05, pvalflag = TRUE, interactive = T) ## generate cell-type specific gene expression ct_exprs_list <- celltype_expression(bulk = bulk, ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, prop = decon_res[[1]], UMI_min = 0, CELL_MIN_INSTANCE = 1) ## End(Not run)
## Not run: ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) ## construct a vector with same proportions across cell types prop1 <- data.frame(celltypes = unique(refdata$celltype), proportion = rep(0.125, 8)) ## simulate 20 bulk samples based on specified cell type proportion bulk_sim1 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, prop = prop1, replace = TRUE) ## generate another vector with high proportion for a certian cell type prop2 <- data.frame(celltypes = unique(refdata$celltype), proportion = c(0.8, 0.1, 0.1, rep(0, 5))) bulk_sim2 <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, prop = prop2, replace = TRUE) ## compare data for differential analysis bulk_sim <- list(cbind(bulk_sim1[[1]], bulk_sim2[[1]]), cbind(bulk_sim1[[2]], bulk_sim2[[2]])) ## force to be integer for DE purposes bulk <- round(bulk_sim[[1]], digits=0) colnames(bulk) <- paste0("sample", 1:ncol(bulk)) ## predict cell type proportions using "OLS" algorithm decon_res <- scdecon(bulk = bulk, ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, filter_ref = TRUE, decon_method = "OLS", norm_method = "none", trans_method = "none", marker_strategy = "all") ## create sampleinfo sampleinfo <- data.frame(condition = rep(c("group1", "group2"), each =20)) row.names(sampleinfo) <- colnames(bulk) library(DESeq2) deres <- run_de(bulk = bulk, prop = decon_res[[1]], sampleinfo = sampleinfo, control = "group1", case = "group2", de_method = "edgeR") ## run differential analysis without adjusting for cell proportion differences deres_notadjust <- run_de(bulk = bulk, prop = NULL, sampleinfo = sampleinfo, control = "group1", case = "group2", de_method = "edgeR") ## scatter plot to compare the effect of adjusting cell proportion differences comparedeg_scatter(results1 = deres[[2]], results2 = deres_notadjust[[2]], result_names = c("adjust for cell proportion", "not adjust for cell proportion"), fc_cutoff = 1.5, pval_cutoff = 0.05, pvalflag = TRUE, interactive = T) ## generate cell-type specific gene expression ct_exprs_list <- celltype_expression(bulk = bulk, ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, prop = decon_res[[1]], UMI_min = 0, CELL_MIN_INSTANCE = 1) ## End(Not run)
Methods to use for data normalization.
scaling( matrix, option, gene_length = NULL, seed = 1234, ffpe_artifacts = FALSE, model = NULL )
scaling( matrix, option, gene_length = NULL, seed = 1234, ffpe_artifacts = FALSE, model = NULL )
matrix |
a matrix-like objector of gene expression values with rows representing genes, columns representing samples or cells |
option |
character value specifying the normalization method to use. Has to be one of "none", "LogNormalize", "TMM", "median_ratios", "TPM", "SCTransform", "scran", "scater", "Linnorm". |
gene_length |
a data.frame with two columns. The first column represents gene names that match with provided bulk data. The second column represents length of each gene. Only applicable when norm_method is selected as "TPM" |
seed |
random seed used for simulating FFPE artifacts. Only applicable when ffpe_artifacts is set to TRUE. |
ffpe_artifacts |
logical value indicating whether to add simulated ffpe artifacts in the bulk data. Only applicable to simulation experiments in evaluating the effect of FFPE artifacts. |
model |
pre-constructed ffpe model data. Can be downloaded from github: https://github.com/Liuy12/SCdeconR/blob/master/data/ffpemodel.rda |
refer to scdecon
for more details.
a matrix-like object with the same dimension of input object after data normalization.
Deconvolution of bulk RNA-seq data based on single-cell reference data. Eight bulk deconvolution methods, along with eight normalization methods and four transformation methods are available.
scdecon( bulk, ref, phenodata, filter_ref = TRUE, marker_genes = NULL, genes_to_remove = NULL, min_pct_ct = 0.05, decon_method = c("scaden", "CIBERSORT", "OLS", "nnls", "FARDEEP", "RLR", "MuSiC", "SCDC", "scTAPE"), norm_method_sc = c("none", "LogNormalize", "SCTransform", "scran", "scater", "Linnorm"), norm_method_bulk = c("none", "TMM", "median_ratios", "TPM"), trans_method_sc = c("none", "log2", "sqrt", "vst"), trans_method_bulk = c("none", "log2", "sqrt", "vst"), gene_length = NULL, lfc_markers = log2(1.5), marker_strategy = c("all", "pos_fc", "top_50p_logFC", "top_50p_AveExpr"), to_remove = NULL, ffpe_artifacts = FALSE, model = NULL, prop = NULL, cibersortpath = NULL, pythonpath = NULL, tmpdir = NULL, remove_tmpdir = TRUE, seed = 1234, nsamples = 1000, return_value_only = FALSE, verbose = FALSE )
scdecon( bulk, ref, phenodata, filter_ref = TRUE, marker_genes = NULL, genes_to_remove = NULL, min_pct_ct = 0.05, decon_method = c("scaden", "CIBERSORT", "OLS", "nnls", "FARDEEP", "RLR", "MuSiC", "SCDC", "scTAPE"), norm_method_sc = c("none", "LogNormalize", "SCTransform", "scran", "scater", "Linnorm"), norm_method_bulk = c("none", "TMM", "median_ratios", "TPM"), trans_method_sc = c("none", "log2", "sqrt", "vst"), trans_method_bulk = c("none", "log2", "sqrt", "vst"), gene_length = NULL, lfc_markers = log2(1.5), marker_strategy = c("all", "pos_fc", "top_50p_logFC", "top_50p_AveExpr"), to_remove = NULL, ffpe_artifacts = FALSE, model = NULL, prop = NULL, cibersortpath = NULL, pythonpath = NULL, tmpdir = NULL, remove_tmpdir = TRUE, seed = 1234, nsamples = 1000, return_value_only = FALSE, verbose = FALSE )
bulk |
a matrix or data.frame of unnormalizaed & untransformed bulk RNA-seq gene expression values with rows representing genes, columns representing samples |
ref |
a matrix or data.frame of untransformed scRNA-seq gene expression counts with rows representing genes, columns representing cells. This data will be used to deconvolute provided bulk RNA-seq data. |
phenodata |
a data.frame with rows representing cells, columns representing cell attributes. It should at least contain the first three columns as:
|
filter_ref |
logical value indicating whether outlier genes & cells should be removed from the provided reference data. Defaults to TRUE |
marker_genes |
a data.frame of two columns. First column represents cell types in ref; second column represents gene names of marker genes. If specified, those genes will be used to construct signature matrix for mark-gene based deconvolution methods, such as CIBERSORT, OLS, nnls, FARDEEP and RLR. Default to NULL, carry out differential analysis to identify marker genes for each cell type in ref. |
genes_to_remove |
a vector of gene names to remove from the reference scRNA-seq data. Default to NULL. |
min_pct_ct |
a numeric value indicating the minimum required proportion of expressing cells per cell type for marker gene identification. Only applicable when marker_genes is NULL. Default to 0.05. |
decon_method |
character value specifying the deconvolution method to use. Has to be one of "scaden", "CIBERSORT", "OLS", "nnls", "FARDEEP", "RLR", "MuSiC", "SCDC", "scTAPE". See details for more information. |
norm_method_sc |
character value specifying the normalization method to use for reference data. Has to be one of "none","LogNormalize", "SCTransform", "scran", "scater", "Linnorm". See details for more information. |
norm_method_bulk |
character value specifying the normalization method to use for bulk data. Has to be one of "none", "TMM", "median_ratios", "TPM". See details for more information. |
trans_method_sc |
character value specifying the transformation method to use for both bulk & reference data. Has to be one of "none", "log2", "sqrt", "vst". See details for more information. |
trans_method_bulk |
character value specifying the transformation method to use for both bulk & reference data. Has to be one of "none", "log2", "sqrt", "vst". See details for more information. |
gene_length |
a data.frame with two columns. The first column represents gene names that match with provided bulk data. The second column represents length of each gene. Only applicable when norm_method is selected as "TPM". |
lfc_markers |
log2 fold change cutoff used to identify marker genes for deconvolution. The option only applicable to marker-gene based approaches, such as CIBERSORT, OLS, nnls, FARDEEP and RLR. Only applicable when marker_genes is NULL. |
marker_strategy |
further strategy in selecting marker genes besides applying the log2 fold change cutoff. Can be chosen from: "all", "pos_fc", "top_50p_logFC" or "top_50p_AveExpr". See details for more information. Only applicable when marker_genes is NULL. |
to_remove |
character value representing the cell type to remove from reference data. Only applicable to simulation experiments in evaluating effect of cell type removal from reference. |
ffpe_artifacts |
logical value indicating whether to add simulated ffpe artifacts in the bulk data. Only applicable to simulation experiments in evaluating the effect of FFPE artifacts. |
model |
pre-constructed ffpe model data. Can be downloaded from github: https://github.com/Liuy12/SCdeconR/blob/master/data/ffpemodel.rda. |
prop |
a matrix or data.frame of simulated cell proportion values with rows representing cell types, columns representing samples. Only applicable to simulation experiments in evaluating the effect of cell type removal from reference. |
cibersortpath |
full path to CIBERSORT.R script. |
pythonpath |
full path to python binary where scaden was installed with. |
tmpdir |
temporary processing directory for scaden or scTAPE. |
remove_tmpdir |
a logical value indicating whether to remove tmpdir once scaden is completed. Default to TRUE. |
seed |
random seed used for simulating FFPE artifacts. Only applicable when ffpe_artifacts is set to TRUE. |
nsamples |
number of artificial bulk samples to simulate for scaden. Default to 1000. |
return_value_only |
return a list of values only without performing deconvolution. This could be helpful in cases where the user want to apply their own deconvolution algorithms. Default to FALSE. |
verbose |
a logical value indicating whether to print messages. Default to FALSE. |
decon_method should be one of the following:
a deep learning based method using three multi-layer deep neural nets. To use scaden,
you need to firstly install scaden via pip
or conda
, the provide the python path to pythonpath
option.
a marker gene based support vectors regression approach. CIBERSOR does not allow redistribution. To use CIBERSORT,
you need to request the source code from the authors & provide the path of CIBERSORT.R script to cibersortpath
option.
ordinary least squares.
non-negative least squares.
robust regression using least trimmed squares
robust regression using an M estimator
multi-subject single-cell deconvolution
an ENSEMBLE method to integrate deconvolution results from different scRNA-seq datasets
Deep autoencoder based deconvolution
norm_method should be one of the following:
no normalization is performed.
LogNormalize
method from seurat.
TMM method from calcNormFactors
function from edgeR.
median ratio method from estimateSizeFactors,DESeqDataSet-method
function from DESeq2.
Transcript per million. TPM has to be chosen if ffpe_artifacts is set to TRUE.
SCTransform
method from Seurat.
computeSumFactors
method from scran.
librarySizeFactors
method from scater.
Linnorm
method from Linnorm.
trans_method should be one of the following:
no transformation is performed.
log2 transformation. 0.1 is added to the data to avoid logarithm of 0s.
square root transformation.
varianceStabilizingTransformation
method from DESeq2.
marker_strategy should be one of the following:
all genes passed fold change threshold will be used.
only genes with positive fold changes will be used.
only genes with top 50 percent positive fold changes will be used.
only genes with top 50 percent average expression will be used.
a list containing two or four elements.
a data.frame of predicted cell-type proportions, with rows representing cell types, columns representing samples.
a data.frame of fitting errors of the algorithm; first column represents sample names, second column represents RMSEs.
a data.frame of simulated cell proportion after removing the specified cell_type. Only applicable to simulation experiments.
a data.frame of marker genes used for deconvolution. Only applicable to marker-gene based deconvolution methods.
## Not run: ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, num_mixtures_sprop = 1) ## perform deconvolution based on "OLS" algorithm decon_res <- scdecon(bulk = bulk_sim[[1]], ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, filter_ref = TRUE, decon_method = "OLS", norm_method = "none", trans_method = "none", marker_strategy = "all") ## End(Not run)
## Not run: ref_list <- c(paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample1"), paste0(system.file("extdata", package = "SCdeconR"), "/refdata/sample2")) phenopath1 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample1.txt") phenopath2 <- paste0(system.file("extdata", package = "SCdeconR"), "/refdata/phenodata_sample2.txt") phenodata_list <- c(phenopath1,phenopath2) # construct integrated reference using harmony algorithm refdata <- construct_ref(ref_list = ref_list, phenodata_list = phenodata_list, data_type = "cellranger", method = "harmony", group_var = "subjectid", nfeature_rna = 50, vars_to_regress = "percent_mt", verbose = FALSE) phenodata <- data.frame(cellid = colnames(refdata), celltypes = refdata$celltype, subjectid = refdata$subjectid) bulk_sim <- bulk_generator(ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, num_mixtures = 20, num_mixtures_sprop = 1) ## perform deconvolution based on "OLS" algorithm decon_res <- scdecon(bulk = bulk_sim[[1]], ref = GetAssayData(refdata, slot = "data", assay = "SCT"), phenodata = phenodata, filter_ref = TRUE, decon_method = "OLS", norm_method = "none", trans_method = "none", marker_strategy = "all") ## End(Not run)
Methods to use for data transformation.
transformation(matrix, option)
transformation(matrix, option)
matrix |
a matrix-like objector of gene expression values with rows representing genes, columns representing samples or cells |
option |
character value specifying the transformation method to use. Has to be one of "none", "log", "sqrt", "vst". |
refer to scdecon
for more details.
a matrix-like object with the same dimension of input object after data transformation.