Define a directory where the ouput will be stored.

0.1 Analytical pipeline

Initialize the COTAN object with the row count table and the metadata for the experiment.

Now we can start the cleaning. Analysis requires and starts from a matrix of raw UMI counts after removing possible cell doublets or multiplets and low quality or dying cells (with too high mtRNA percentage, easily done with Seurat or other tools).

If we do not want to consider the mitochondrial genes we can remove them before starting the analysis.

We want also to define a prefix to identify the sample.

0.2 Data cleaning

First we create the directory to store all information regardin the data cleaning.

Run this when B cells need to be removed.

pdf(paste(out_dir, "cleaning/", t, "_", n_it, 
    "_plots_before_cells_exlusion.pdf", sep = ""))
ttm$pca.cell.2
ggplot(ttm$D, aes(x = n, y = means)) + geom_point() + 
    geom_text_repel(data = subset(ttm$D, 
        n > (max(ttm$D$n) - 15)), aes(n, 
        means, label = rownames(ttm$D[ttm$D$n > 
            (max(ttm$D$n) - 15), ])), nudge_y = 0.05, 
        nudge_x = 0.05, direction = "x", 
        angle = 90, vjust = 0, segment.size = 0.2) + 
    ggtitle("B cell group genes mean expression") + 
    my_theme + theme(plot.title = element_text(color = "#3C5488FF", 
    size = 20, face = "italic", vjust = -5, 
    hjust = 0.02))
dev.off()
#> png 
#>   2

if (length(ttm$cl1) < length(ttm$cl2)) {
    to_rem = ttm$cl1
} else {
    to_rem = ttm$cl2
}
n_it = n_it + 1
obj@raw = obj@raw[, !colnames(obj@raw) %in% 
    to_rem]
# obj@raw = obj@raw[rownames(obj@raw)
# %in% rownames(cells),colnames(obj@raw)
# %in% colnames(cells)]
gc()
#>             used   (Mb) gc trigger   (Mb)  max used   (Mb)
#> Ncells   3706915  198.0    6015681  321.3   6015681  321.3
#> Vcells 231648395 1767.4  347181239 2648.8 347179504 2648.8

ttm = clean(obj)
#> [1] "Start estimation mu with linear method"
#> [1] 14126  2284
#>       rowname X10X82_1_GTTCCGTTCGGGCT. X10X82_1_CGTAAGTAGAAAGG.
#> 12935     Ttr                729.43369               1156.30913
#> 7116   Malat1                140.43227                 68.94445
#> 10455   Rpl41                 60.65046                 85.11661
#> 4618     Ftl1                 38.26270                 31.06756
#> 10502   Rps23                 24.83005                 41.28156
#> 10512   Rps29                 29.30760                 34.04664
#> 10507   Rps27                 23.20185                 28.93965
#> 10489   Rps14                 29.30760                 26.38615
#> 10451  Rpl37a                 25.23710                 33.19548
#> 1792      Bsg                 21.57365                 24.25823
#> 4617     Fth1                 22.79480                 22.13032
#> 10416   Rpl13                 26.45825                 18.30007
#> 3745   Eef1a1                 19.94545                 20.42799
#> 10450   Rpl37                 22.38775                 20.85357
#> 10424  Rpl18a                 21.98070                 19.57682
#>       X10X82_1_CGGATAGCCACGCT.
#> 12935                891.76077
#> 7116                 123.00149
#> 10455                 62.52575
#> 4618                  43.56303
#> 10502                 37.92546
#> 10512                 28.70035
#> 10507                 29.21285
#> 10489                 25.62531
#> 10451                 21.01275
#> 1792                  25.62531
#> 4617                  25.62531
#> 10416                 21.52526
#> 3745                  25.62531
#> 10450                 22.03777
#> 10424                 21.52526
# ttm = clean.sqrt(obj, cells)
obj = ttm$object

ttm$pca.cell.2

Run this only in the last iteration, instead the previous code, when B cells group has not to be removed.

To colour the pca based on nu_j (so the cells’ efficiency).

The next part is used to remove the cells with efficiency too low.

We can zoom on the smallest values and, if we detect a clear elbow, we can decide to remove the cells.

We also save the defined threshold in the metadata and re-run the estimation.

Repeat the estimation after the cells are removed.

In case we do not want to remove anything, we can run:

Just to check again, we plot the final efficiency coloured PCA.

COTAN analysis: in this part all the contingency tables are computed and used to get the statistics (S) To storage efficiency of all the observed tables only the yes_yes is stored.

Note that if will be necessary re-computing the yes-yes table, the yes-yes table should be cancelled before running cotan_analysis.

COEX evaluation and storing.

Automatic data elaboration (without cleaning)

obj2 = automatic.COTAN.object.creation(df = data, 
    GEO = "GSE104323", sc.method = "10X", 
    cond = "E16_dent_gy", out_dir = "Data/")
#> [1] "Initializing S4 object"
#> [1] "Condition E16_dent_gy"
#> [1] "n cells 2285"
#> [1] "Start estimation mu with linear method"
#> [1] 14131  2285
#>       rowname X10X82_1_TACTGAGGCATGGT.
#> 9692    Ptgds                910.01759
#> 7119   Malat1                144.82597
#> 8232     Nnat                 56.04377
#> 4619     Ftl1                 52.71443
#> 1127     Apoe                 44.94599
#> 1404   Atp1a2                 43.83621
#> 2343   Cdkn1c                 38.84221
#> 2840   Cox4i1                 32.73844
#> 2854    Cox8a                 32.73844
#> 10515   Rps29                 32.18355
#> 3746   Eef1a1                 28.29933
#> 10510   Rps27                 27.18955
#> 13168  Uqcr11                 25.52488
#> 10458   Rpl41                 24.41511
#> 10454  Rpl37a                 23.30533
#> [1] "Cotan analysis function started"
#> [1] "cotan analysis"
#> [1] "mu estimator creation"
#> [1] "save effective constitutive genes"
#> [1] "start a minimization"
#> [1] "Next gene: Btg1 number 1810"
#> [1] "Next gene: Eif2b3 number 3820"
#> [1] "Next gene: Hddc3 number 5830"
#> [1] "Next gene: Myh2 number 7840"
#> [1] "Next gene: Rab14 number 9850"
#> [1] "Next gene: Suv420h1 number 11860"
#> [1] "Next gene: Zfp706 number 13870"
#> [1] "a min: -0.16796875 | a max 881.25 | negative a %: 23.9800995024876"
#> [1] "Only analysis time 2.71626739501953"
#> [1] "Cotan coex estimation started"
#> [1] "coex dataframe creation"
#> [1] "Generating contingency tables for observed data"
#> [1] "creation of observed yes/yes contingency table"
#> [1] "mu estimator creation"
#> [1] "expected contingency tables creation"
#> [1] "The distance between estimated n of zeros and observed number of zero is 0.0633983950493963 over 14070"
#> [1] "Done"
#> [1] "coex estimation"
#> [1] "Diagonal coex values substituted with 0"
#> [1] "Total time 5.0609365383784"
#> [1] "Only coex time 1.99707407156626"
#> [1] "Saving elaborated data locally at Data/E16_dent_gy.cotan.RDS"

1 Analysis of the elaborated data

In this case the test possible macroscopic differences between cleaned and not cleaned data.

The initial cell number was:

While the final number of cells was:

So were removed:

cells

1.1 GDI with cleaning

In the third column of this data frame is reported the percentage of cells expressing the gene.

NPGs = c("Nes", "Vim", "Sox2", "Sox1", "Notch1", 
    "Hes1", "Hes5", "Pax6")  #,'Hes3'
PNGs = c("Map2", "Tubb3", "Neurod1", "Nefm", 
    "Nefl", "Dcx", "Tbr1")
hk = c("Calm1", "Cox6b1", "Ppia", "Rpl18", 
    "Cox7c", "Erh", "H3f3a", "Taf1", "Taf2", 
    "Gapdh", "Actb", "Golph3", "Mtmr12", 
    "Zfr", "Sub1", "Tars", "Amacr")

text.size = 12

quant.p$highlight = with(quant.p, ifelse(rownames(quant.p) %in% 
    NPGs, "NPGs", ifelse(rownames(quant.p) %in% 
    hk, "Constitutive", ifelse(rownames(quant.p) %in% 
    PNGs, "PNGs", "normal"))))

textdf <- quant.p[rownames(quant.p) %in% 
    c(NPGs, hk, PNGs), ]
mycolours <- c(Constitutive = "#00A087FF", 
    NPGs = "#E64B35FF", PNGs = "#F39B7FFF", 
    normal = "#8491B4B2")
f1 = ggplot(subset(quant.p, highlight == 
    "normal"), aes(x = sum.raw.norm, y = GDI)) + 
    geom_point(alpha = 0.1, color = "#8491B4B2", 
        size = 2.5)
GDI_plot = f1 + geom_point(data = subset(quant.p, 
    highlight != "normal"), aes(x = sum.raw.norm, 
    y = GDI, colour = highlight), size = 2.5, 
    alpha = 0.8) + geom_hline(yintercept = quantile(quant.p$GDI)[4], 
    linetype = "dashed", color = "darkblue") + 
    geom_hline(yintercept = quantile(quant.p$GDI)[3], 
        linetype = "dashed", color = "darkblue") + 
    geom_hline(yintercept = 1.5, linetype = "dotted", 
        color = "red", size = 0.5) + scale_color_manual("Status", 
    values = mycolours) + scale_fill_manual("Status", 
    values = mycolours) + xlab("log normalized counts") + 
    ylab("GDI") + geom_label_repel(data = textdf, 
    aes(x = sum.raw.norm, y = GDI, label = rownames(textdf), 
        fill = highlight), label.size = NA, 
    alpha = 0.5, direction = "both", na.rm = TRUE, 
    seed = 1234) + geom_label_repel(data = textdf, 
    aes(x = sum.raw.norm, y = GDI, label = rownames(textdf)), 
    label.size = NA, segment.color = "black", 
    segment.size = 0.5, direction = "both", 
    alpha = 0.8, na.rm = TRUE, fill = NA, 
    seed = 1234) + theme(axis.text.x = element_text(size = text.size, 
    angle = 0, hjust = 0.5, vjust = 0.5, 
    face = "plain", colour = "#3C5488FF"), 
    axis.text.y = element_text(size = text.size, 
        angle = 0, hjust = 0, vjust = 0.5, 
        face = "plain", colour = "#3C5488FF"), 
    axis.title.x = element_text(size = text.size, 
        angle = 0, hjust = 0.5, vjust = 0, 
        face = "plain", colour = "#3C5488FF"), 
    axis.title.y = element_text(size = text.size, 
        angle = 90, hjust = 0.5, vjust = 0.5, 
        face = "plain", colour = "#3C5488FF"), 
    legend.title = element_blank(), legend.text = element_text(color = "#3C5488FF", 
        face = "italic"), legend.position = "bottom")  # titl)
legend <- cowplot::get_legend(GDI_plot)
GDI_plot = GDI_plot + theme(legend.position = "none")
GDI_plot

1.2 GDI without cleaning

In the third column of this data frame is reported the percentage of cells expressing the gene.

NPGs = c("Nes", "Vim", "Sox2", "Sox1", "Notch1", 
    "Hes1", "Hes5", "Pax6")  #,'Hes3'
PNGs = c("Map2", "Tubb3", "Neurod1", "Nefm", 
    "Nefl", "Dcx", "Tbr1")
hk = c("Calm1", "Cox6b1", "Ppia", "Rpl18", 
    "Cox7c", "Erh", "H3f3a", "Taf1", "Taf2", 
    "Gapdh", "Actb", "Golph3", "Mtmr12", 
    "Zfr", "Sub1", "Tars", "Amacr")

text.size = 12

quant.p$highlight = with(quant.p, ifelse(rownames(quant.p) %in% 
    NPGs, "NPGs", ifelse(rownames(quant.p) %in% 
    hk, "Constitutive", ifelse(rownames(quant.p) %in% 
    PNGs, "PNGs", "normal"))))

textdf <- quant.p[rownames(quant.p) %in% 
    c(NPGs, hk, PNGs), ]
mycolours <- c(Constitutive = "#00A087FF", 
    NPGs = "#E64B35FF", PNGs = "#F39B7FFF", 
    normal = "#8491B4B2")
f1 = ggplot(subset(quant.p, highlight == 
    "normal"), aes(x = sum.raw.norm, y = GDI)) + 
    geom_point(alpha = 0.1, color = "#8491B4B2", 
        size = 2.5)
GDI_plot = f1 + geom_point(data = subset(quant.p, 
    highlight != "normal"), aes(x = sum.raw.norm, 
    y = GDI, colour = highlight), size = 2.5, 
    alpha = 0.8) + geom_hline(yintercept = quantile(quant.p$GDI)[4], 
    linetype = "dashed", color = "darkblue") + 
    geom_hline(yintercept = quantile(quant.p$GDI)[3], 
        linetype = "dashed", color = "darkblue") + 
    geom_hline(yintercept = 1.5, linetype = "dotted", 
        color = "red", size = 0.5) + scale_color_manual("Status", 
    values = mycolours) + scale_fill_manual("Status", 
    values = mycolours) + xlab("log normalized counts") + 
    ylab("GDI") + geom_label_repel(data = textdf, 
    aes(x = sum.raw.norm, y = GDI, label = rownames(textdf), 
        fill = highlight), label.size = NA, 
    alpha = 0.5, direction = "both", na.rm = TRUE, 
    seed = 1234) + geom_label_repel(data = textdf, 
    aes(x = sum.raw.norm, y = GDI, label = rownames(textdf)), 
    label.size = NA, segment.color = "black", 
    segment.size = 0.5, direction = "both", 
    alpha = 0.8, na.rm = TRUE, fill = NA, 
    seed = 1234) + theme(axis.text.x = element_text(size = text.size, 
    angle = 0, hjust = 0.5, vjust = 0.5, 
    face = "plain", colour = "#3C5488FF"), 
    axis.text.y = element_text(size = text.size, 
        angle = 0, hjust = 0, vjust = 0.5, 
        face = "plain", colour = "#3C5488FF"), 
    axis.title.x = element_text(size = text.size, 
        angle = 0, hjust = 0.5, vjust = 0, 
        face = "plain", colour = "#3C5488FF"), 
    axis.title.y = element_text(size = text.size, 
        angle = 90, hjust = 0.5, vjust = 0.5, 
        face = "plain", colour = "#3C5488FF"), 
    legend.title = element_blank(), legend.text = element_text(color = "#3C5488FF", 
        face = "italic"), legend.position = "bottom")  # titl)
legend <- cowplot::get_legend(GDI_plot)
GDI_plot = GDI_plot + theme(legend.position = "none")
GDI_plot

1.3 Heatmaps with cleaning

For the Gene Pair Analysis, we can plot an heatmap with the coex values between two genes sets. To do so we need to define, in a list, the different gene sets (list.genes). Then in the function parameter sets we can decide which sets need to be considered (in the example from 1 to 3). In the condition parameter we should insert an array with each file name prefix to be considered (in the example, the file names is “E15_cortex_clean”).

1.4 Heatmaps without cleaning

print(sessionInfo())
#> R version 4.0.4 (2021-02-15)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] latex2exp_0.5.0   ggrepel_0.9.1     ggplot2_3.3.3     Matrix_1.3-2     
#> [5] data.table_1.14.0 COTAN_0.1.0      
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.6           lattice_0.20-41      circlize_0.4.12     
#>  [4] tidyr_1.1.2          png_0.1-7            assertthat_0.2.1    
#>  [7] digest_0.6.27        utf8_1.2.1           R6_2.5.0            
#> [10] stats4_4.0.4         evaluate_0.14        highr_0.8           
#> [13] pillar_1.5.1         basilisk_1.2.1       GlobalOptions_0.1.2 
#> [16] rlang_0.4.10         jquerylib_0.1.3      S4Vectors_0.28.1    
#> [19] GetoptLong_1.0.5     reticulate_1.18      rmarkdown_2.7       
#> [22] labeling_0.4.2       stringr_1.4.0        munsell_0.5.0       
#> [25] compiler_4.0.4       xfun_0.22            pkgconfig_2.0.3     
#> [28] BiocGenerics_0.36.0  shape_1.4.5          htmltools_0.5.1.1   
#> [31] tidyselect_1.1.0     tibble_3.1.0         IRanges_2.24.1      
#> [34] matrixStats_0.58.0   fansi_0.4.2          withr_2.4.1         
#> [37] crayon_1.4.0         dplyr_1.0.4          rappdirs_0.3.3      
#> [40] basilisk.utils_1.2.2 grid_4.0.4           gtable_0.3.0        
#> [43] jsonlite_1.7.2       lifecycle_1.0.0      DBI_1.1.1           
#> [46] magrittr_2.0.1       formatR_1.8          scales_1.1.1        
#> [49] stringi_1.5.3        farver_2.1.0         bslib_0.2.4         
#> [52] ellipsis_0.3.1       filelock_1.0.2       generics_0.1.0      
#> [55] vctrs_0.3.6          cowplot_1.1.1        rjson_0.2.20        
#> [58] RColorBrewer_1.1-2   tools_4.0.4          Cairo_1.5-12.2      
#> [61] glue_1.4.2           purrr_0.3.4          parallel_4.0.4      
#> [64] yaml_2.2.1           clue_0.3-58          colorspace_2.0-0    
#> [67] cluster_2.1.1        ComplexHeatmap_2.6.2 knitr_1.31          
#> [70] sass_0.3.1