
scGraphVerse Case Study: B-cell GRN Reconstruction
Francesco Cecere
Source:vignettes/case_study.Rmd
case_study.Rmd
Introduction
This vignette demonstrates the scGraphVerse workflow on a single-cell RNA-seq dataset. We show how to:
- Load and preprocess public PBMC data.
- Infer gene regulatory networks with GENIE3.
- Build consensus networks and detect communities.
- Validate inferred edges using STRINGdb.
1. Dataset and Preprocessing
In this real-data analysis, we’ll work with two public PBMC (Peripheral Blood Mononuclear Cell) datasets from 10X Genomics. Our strategy is to focus specifically on B cells and identify a common set of highly expressed genes across both datasets. This approach allows us to compare regulatory networks between different experimental batches while controlling for cell type and gene selection effects.
Data Processing Workflow: 1. Load two PBMC datasets (3k and 4k cells) from TENxPBMCData 2. Use SingleR for automated cell type annotation 3. Select top 500 most expressed genes in B cells from each dataset 4. Find intersection of expressed genes to ensure comparable gene sets 5. Subset datasets to B cells only with common gene set
This preprocessing ensures we have clean, comparable data for network inference.
# 1. Load PBMC data
pbmc_obj <- TENxPBMCData("pbmc3k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> loading from cache
pbmc_obj1 <- TENxPBMCData("pbmc4k")
#> see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation
#> loading from cache
sce <- logNormCounts(pbmc_obj)
symbols_tenx <- rowData(sce)$Symbol_TENx
valid <- !is.na(symbols_tenx) & symbols_tenx != ""
sce <- sce[valid, ]
rownames(sce) <- make.unique(symbols_tenx[valid])
logcounts(sce) <- as.matrix(logcounts(sce))
colnames(sce) <- paste0("cell_", seq_len(ncol(sce)))
sce1 <- logNormCounts(pbmc_obj1)
symbols_tenx1 <- rowData(sce1)$Symbol_TENx
valid1 <- !is.na(symbols_tenx1) & symbols_tenx1 != ""
sce1 <- sce1[valid1, ]
rownames(sce1) <- make.unique(symbols_tenx1[valid1])
logcounts(sce1) <- as.matrix(logcounts(sce1))
colnames(sce1) <- paste0("cell_", seq_len(ncol(sce1)))
# 2. Cell type annotation
ref <- celldex::HumanPrimaryCellAtlasData()
pred1 <- SingleR(test = sce1, ref = ref, labels = ref$label.main)
colData(sce1)$predicted_celltype <- pred1$labels
pred <- SingleR(test = sce, ref = ref, labels = ref$label.main)
colData(sce)$predicted_celltype <- pred$labels
# 3. Select top 500 B-cell genes
genes <- selgene(
object = sce,
top_n = 500,
cell_type = "B_cell",
cell_type_col = "predicted_celltype",
remove_rib = TRUE,
remove_mt = TRUE
)
#> Using SCE assay 'logcounts' (log-normalized).
#> Subsetted to 344 cells where predicted_celltype = 'B_cell'.
#> Removed mitochondrial genes matching '^MT-'.
#> Removed ribosomal genes matching '^RP[SL]'.
#> Top 500 genes selected based on mean expression.
genes1 <- selgene(
object = sce1,
top_n = 500,
cell_type = "B_cell",
cell_type_col = "predicted_celltype",
remove_rib = TRUE,
remove_mt = TRUE
)
#> Using SCE assay 'logcounts' (log-normalized).
#> Subsetted to 607 cells where predicted_celltype = 'B_cell'.
#> Removed mitochondrial genes matching '^MT-'.
#> Removed ribosomal genes matching '^RP[SL]'.
#> Top 500 genes selected based on mean expression.
# 4. Find intersection
commong <- intersect(genes, genes1)
# 5. Subset data
pbmc1_sub <- subset(sce, features = commong)
pbmc2_sub <- subset(sce1, features = commong)
pbmc1_sub <- pbmc1_sub[commong, ]
pbmc2_sub <- pbmc2_sub[commong, ]
pbmc1_sub <- pbmc1_sub[, colData(pbmc1_sub)$predicted_celltype == "B_cell"]
pbmc2_sub <- pbmc2_sub[, colData(pbmc2_sub)$predicted_celltype == "B_cell"]
# List for multi-sample analysis
bcell_list <- list(pbmc1_sub, pbmc2_sub)
2. Network Inference
Now we’ll infer gene regulatory networks from our preprocessed B cell data using GENIE3. Unlike the simulation study where we had control over all parameters, here we’re working with real biological data that presents unique challenges: B cells have distinct expression patterns, lower total gene counts compared to the full transcriptome, and batch effects between the two PBMC datasets.
# Choose method: "GENIE3", "GRNBoost2", "ZILGM", "PCzinb" or "JRF"
method <- "GENIE3"
networks <- infer_networks(
count_matrices_list = bcell_list,
method = method,
nCores = 1
)
2.1. Building Adjacency Matrices
Here we apply a more stringent threshold (99th percentile) compared to the simulation study (95th percentile). This is appropriate for real data analysis where we expect higher noise levels and want to focus on the most confident regulatory relationships. With B cell data, we’re particularly interested in high-confidence edges controlling B cell function.
# Weighted adjacency
wadj <- generate_adjacency(networks)
# Symmetrize
swadj <- symmetrize(wadj, weight_function = "mean")
# Binary cutoff (top 1%)
binary_adj <- cutoff_adjacency(
count_matrices = bcell_list,
weighted_adjm_list = swadj,
n = 2,
method = method,
quantile_threshold = 0.99,
nCores = 1
)
# Plot
plotg(binary_adj)
3. Consensus and Community Detection
With only two PBMC datasets, our consensus network will include edges that appear in both batches, providing high confidence in the regulatory relationships. This is different from the simulation study where we had three conditions and used a majority vote.
# Consensus by vote
consensus <- create_consensus(binary_adj, method = "vote")
plotg(list(consensus))
Here we demonstrate two key features not shown in the simulation
study: network comparison with external databases and
the false_plot
parameter in
compare_consensus()
.
# Note: This requires internet connection for STRINGdb
# To run this section, set eval=TRUE and ensure internet connectivity
compare_consensus(consensus_matrix = consensus, false_plot = TRUE)
#> Initializing STRINGdb...
#> Mapping genes to STRING IDs...
#> Warning: we couldn't map to STRING 1% of your identifiers
#> Mapped 389 genes to STRING IDs.
#> 5 genes were not found in STRING but will be included as zero rows/columns.
#> Retrieving **physical** interactions from STRING API...
#> Found 288 STRING physical interactions.
#> Adjacency matrices constructed successfully.
# Community detection
communities <- community_path(consensus)
#> Detecting communities...
#> Running pathway enrichment...
#> 'select()' returned 1:1 mapping between keys and columns
#> Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
#> Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
4. Validation with STRINGdb
Unlike the simulation study where we used STRINGdb to create ground
truth, here we use it for validation of our inferred B cell regulatory
network. The keep_all_genes = TRUE
parameter ensures we
retain information for all genes in our dataset, even those not found in
STRING, which is important for comprehensive edge mining analysis.
# Note: This requires internet connection for STRINGdb downloads
# To run this section, set eval=TRUE and ensure internet connectivity
str <- stringdb_adjacency(
genes = rownames(consensus),
species = 9606,
required_score = 900,
keep_all_genes = TRUE
)$binary
#> Initializing STRINGdb...
#> Mapping genes to STRING IDs...
#> Warning: we couldn't map to STRING 1% of your identifiers
#> Mapped 389 genes to STRING IDs.
#> 5 genes were not found in STRING but will be included as zero rows/columns.
#> Retrieving **physical** interactions from STRING API...
#> Found 288 STRING physical interactions.
#> Adjacency matrices constructed successfully.
ground_truth <- symmetrize(list(str), weight_function = "mean")[[1]]
# Edge mining: TP rates
em <- edge_mining(list(consensus),
query_edge_types = "TP",
ground_truth = ground_truth
)
5. Conclusion
This case study illustrates how scGraphVerse enables end-to-end GRN reconstruction and validation in single-cell data. Users can swap inference algorithms, tune thresholds, and incorporate external prior networks.
Session Information
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 20.04.2 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Rome
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] celldex_1.16.0 SingleR_2.8.0
#> [3] org.Hs.eg.db_3.20.0 AnnotationDbi_1.68.0
#> [5] scater_1.34.1 scuttle_1.16.0
#> [7] TENxPBMCData_1.24.0 HDF5Array_1.34.0
#> [9] rhdf5_2.50.2 DelayedArray_0.32.0
#> [11] SparseArray_1.6.2 S4Arrays_1.6.0
#> [13] abind_1.4-8 Matrix_1.7-3
#> [15] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
#> [17] Biobase_2.66.0 GenomicRanges_1.58.0
#> [19] GenomeInfoDb_1.42.3 IRanges_2.40.1
#> [21] S4Vectors_0.44.0 BiocGenerics_0.52.0
#> [23] MatrixGenerics_1.18.1 matrixStats_1.5.0
#> [25] lubridate_1.9.4 forcats_1.0.0
#> [27] stringr_1.5.1 dplyr_1.1.4
#> [29] purrr_1.1.0 readr_2.1.5
#> [31] tidyr_1.3.1 tibble_3.3.0
#> [33] ggplot2_3.5.2 tidyverse_2.0.0
#> [35] Seurat_5.3.0 SeuratObject_5.1.0
#> [37] sp_2.2-0 scGraphVerse_0.99.0
#> [39] BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] igraph_2.1.4 graph_1.84.1
#> [3] hash_2.2.6.3 ica_1.0-3
#> [5] plotly_4.11.0 Formula_1.2-6
#> [7] zlibbioc_1.52.0 tidyselect_1.2.1
#> [9] bit_4.6.0 doParallel_1.0.17
#> [11] lattice_0.20-44 blob_1.2.4
#> [13] parallel_4.4.2 hdrcde_3.4
#> [15] png_0.1-8 plotrix_3.8-4
#> [17] cli_3.6.5 ggplotify_0.1.2
#> [19] askpass_1.2.1 goftest_1.2-3
#> [21] pkgdown_2.1.3 textshaping_1.0.1
#> [23] BiocNeighbors_2.0.1 fdatest_2.1.1
#> [25] uwot_0.2.3 curl_6.4.0
#> [27] deSolve_1.40 mime_0.13
#> [29] evaluate_1.0.4 tidytree_0.4.6
#> [31] gsubfn_0.7 stringi_1.8.7
#> [33] pROC_1.19.0.1 backports_1.5.0
#> [35] desc_1.4.3 multinet_4.2.2
#> [37] XML_3.99-0.18 Exact_3.3
#> [39] httpuv_1.6.16 magrittr_2.0.3
#> [41] clusterProfiler_4.14.6 rappdirs_0.3.3
#> [43] splines_4.4.2 mclust_6.1.1
#> [45] gbm_2.2.2 rainbow_3.8
#> [47] ggraph_2.2.1 pcaPP_2.0-5
#> [49] rentrez_1.2.4 networkD3_0.4.1
#> [51] ggbeeswarm_0.7.2 sctransform_0.4.2
#> [53] rootSolve_1.8.2.4 DBI_1.2.3
#> [55] jquerylib_0.1.4 reactome.db_1.89.0
#> [57] withr_3.0.2 class_7.3-19
#> [59] systemfonts_1.2.3 enrichplot_1.26.6
#> [61] lmtest_0.9-40 tidygraph_1.3.1
#> [63] BiocManager_1.30.26 htmlwidgets_1.6.4
#> [65] fs_1.6.6 ggrepel_0.9.6
#> [67] labeling_0.4.3 cellranger_1.1.0
#> [69] lmom_3.2 reticulate_1.43.0
#> [71] robin_2.1.0 zoo_1.8-14
#> [73] XVector_0.46.0 knitr_1.50
#> [75] UCSC.utils_1.2.0 timechange_0.3.0
#> [77] mpath_0.4-2.26 foreach_1.5.2
#> [79] fda_6.3.0 patchwork_1.3.1
#> [81] caTools_1.18.3 grid_4.4.2
#> [83] data.table_1.17.8 ggtree_3.14.0
#> [85] R.oo_1.27.1 RSpectra_0.16-2
#> [87] irlba_2.3.5.1 alabaster.schemas_1.6.0
#> [89] fastDummies_1.7.5 gridGraphics_0.5-1
#> [91] DescTools_0.99.60 lazyeval_0.2.2
#> [93] yaml_2.3.10 survival_3.2-11
#> [95] scattermore_1.2 BiocVersion_3.20.0
#> [97] crayon_1.5.3 RcppAnnoy_0.0.22
#> [99] RColorBrewer_1.1-3 progressr_0.15.1
#> [101] tweenr_2.0.3 later_1.4.2
#> [103] ggridges_0.5.6 fds_1.8
#> [105] codetools_0.2-18 KEGGREST_1.46.0
#> [107] Rtsne_0.17 WeightSVM_1.7-16
#> [109] shape_1.4.6.1 ReactomePA_1.50.0
#> [111] filelock_1.0.3 INetTool_0.1.1
#> [113] data.tree_1.1.0 sqldf_0.4-11
#> [115] pkgconfig_2.0.3 spatstat.univar_3.1-4
#> [117] ggpubr_0.6.1 aplot_0.2.8
#> [119] alabaster.base_1.6.1 spatstat.sparse_3.1-0
#> [121] ape_5.8-1 viridisLite_0.4.2
#> [123] xtable_1.8-4 car_3.1-3
#> [125] plyr_1.8.9 httr_1.4.7
#> [127] tools_4.4.2 globals_0.18.0
#> [129] beeswarm_0.4.0 broom_1.0.9
#> [131] nlme_3.1-152 dbplyr_2.5.0
#> [133] ExperimentHub_2.14.0 r2r_0.1.2
#> [135] digest_0.6.37 qpdf_1.4.1
#> [137] numDeriv_2016.8-1.1 bookdown_0.43
#> [139] farver_2.1.2 tzdb_0.5.0
#> [141] reshape2_1.4.4 ks_1.15.1
#> [143] yulab.utils_0.2.0 viridis_0.6.5
#> [145] rpart_4.1-15 glue_1.8.0
#> [147] cachem_1.1.0 BiocFileCache_2.14.0
#> [149] polyclip_1.10-7 generics_0.1.4
#> [151] Biostrings_2.74.1 mvtnorm_1.3-3
#> [153] proto_1.0.0 parallelly_1.45.1
#> [155] pscl_1.5.9 bst_0.3-24
#> [157] RcppHNSW_0.6.0 ragg_1.4.0
#> [159] ScaledMatrix_1.14.0 carData_3.0-5
#> [161] pbapply_1.7-4 httr2_1.1.2
#> [163] glmnet_4.1-10 spam_2.11-1
#> [165] gson_0.1.0 STRINGdb_2.18.0
#> [167] graphlayouts_1.2.2 gtools_3.9.5
#> [169] readxl_1.4.5 alabaster.se_1.6.0
#> [171] ggsignif_0.6.4 gridExtra_2.3
#> [173] shiny_1.11.1 GenomeInfoDbData_1.2.13
#> [175] R.utils_2.13.0 rhdf5filters_1.18.1
#> [177] RCurl_1.98-1.17 memoise_2.0.1
#> [179] rmarkdown_2.29 fmsb_0.7.6
#> [181] scales_1.4.0 R.methodsS3_1.8.2
#> [183] gld_2.6.7 gypsum_1.2.0
#> [185] future_1.67.0 RANN_2.6.2
#> [187] spatstat.data_3.1-8 rstudioapi_0.17.1
#> [189] cluster_2.1.2 perturbR_0.1.3
#> [191] spatstat.utils_3.1-5 hms_1.1.3
#> [193] fitdistrplus_1.2-4 cowplot_1.2.0
#> [195] colorspace_2.1-1 rlang_1.1.6
#> [197] GENIE3_1.28.0 sparseMatrixStats_1.18.0
#> [199] DelayedMatrixStats_1.28.1 dotCall64_1.2
#> [201] ggforce_0.5.0 ggtangle_0.0.7
#> [203] xfun_0.53 alabaster.matrix_1.6.1
#> [205] e1071_1.7-16 iterators_1.0.14
#> [207] GOSemSim_2.32.0 treeio_1.30.0
#> [209] Rhdf5lib_1.28.0 bitops_1.0-9
#> [211] promises_1.3.3 RSQLite_2.4.2
#> [213] qvalue_2.38.0 fgsea_1.32.4
#> [215] proxy_0.4-27 GO.db_3.20.0
#> [217] compiler_4.4.2 alabaster.ranges_1.6.0
#> [219] distributions3_0.2.2 beachmat_2.22.0
#> [221] boot_1.3-28 graphite_1.52.0
#> [223] listenv_0.9.1 Rcpp_1.1.0
#> [225] BiocSingular_1.22.0 AnnotationHub_3.14.0
#> [227] tensor_1.5.1 MASS_7.3-65
#> [229] BiocParallel_1.40.2 spatstat.random_3.4-1
#> [231] R6_2.6.1 fastmap_1.2.0
#> [233] fastmatch_1.1-6 rstatix_0.7.2
#> [235] vipor_0.4.7 ROCR_1.0-11
#> [237] rsvd_1.0.5 gtable_0.3.6
#> [239] KernSmooth_2.23-20 miniUI_0.1.2
#> [241] deldir_2.0-4 htmltools_0.5.8.1
#> [243] bit64_4.6.0-1 spatstat.explore_3.5-2
#> [245] lifecycle_1.0.4 sass_0.4.10
#> [247] vctrs_0.6.5 spatstat.geom_3.5-0
#> [249] DOSE_4.0.1 haven_2.5.5
#> [251] ggfun_0.2.0 future.apply_1.20.0
#> [253] pracma_2.4.4 bslib_0.9.0
#> [255] pillar_1.11.0 gplots_3.2.0
#> [257] jsonlite_2.0.0 expm_1.0-0
#> [259] chron_2.3-62