1 Reading and quantifying the data

I load:

Samples table
SRR sample condition group_or_time_point biol_rep tech_rep run library_layout read_length sample_name
none BS_EMSeq_D0-1 WT 1 1 1 1 SINGLE - 150 D0-1
none BS_EMSeq_D0-2 KO 1 1 1 1 SINGLE - 150 D0-2
none BS_EMSeq_D0-3 WT 1 2 1 1 SINGLE - 150 D0-3
none BS_EMSeq_D0-4 KO 1 2 1 1 SINGLE - 150 D0-4
none BS_EMSeq_D14-1 KO 3 1 1 1 SINGLE - 150 D14-1
none BS_EMSeq_D14-3 WT 3 1 1 1 SINGLE - 150 D14-3
none BS_EMSeq_D14-4 KO 3 2 1 1 SINGLE - 150 D14-4
none BS_EMSeq_D14-5 WT 3 2 1 1 SINGLE - 150 D14-5
none BS_EMSeq_D7-1 WT 2 1 1 1 SINGLE - 150 D7-1
none BS_EMSeq_D7-2 KO 2 1 1 1 SINGLE - 150 D7-2
none BS_EMSeq_D7-4 KO 2 2 1 1 SINGLE - 150 D7-4
none BS_EMSeq_D7-5 WT 2 2 1 1 SINGLE - 150 D7-5

2 Data exploration

2.1 Global methylation statistics for 200 CpG tiles

2.2 Checking any presence of batch effects using methylation count over 200 CpG tiles

3 Methylation by genomic features

I annotate tiles with genomic features using genome annotation ../../data/annotations/gencode.vM25.annotation.bed.

3.1 Relationship between CpG density of a tile and its bias towards certain features

I assume that tiles of different width (i.e. different CpG density) will be biased towards certain features because of the organization of CpGs in the genome.

Therefore, I first of all bin the width of tiles and plot the annotation result using different bins of tiles’ width i.e. different bins of CpG density.

It is clear that the bins [0, 1000), [1000, 5000) and > 10000 are more associated to CpG rich promoters, gene body and intergenic regions (and CpG shores), respectively.

3.2 Methylation distribution for tiles annotated by genomic features, High and Low CpG promoters and non-promoters CpG Islands

  • I annotate the 50 CpG tiles with genomic features, except for promoters, using the following precedence: promoter, exon, intron, intergenic, HCG_promoters, LCG_promoters, non_promoter_CGI. From tiles annotated as intergenic, I remove all the ones which also overlap a repeat element used to count over Transposable Elements (i.e. a repeat element in ../../data/annotations/RepeatMasker_RepeatLibrary20140131_mm10.noGenes.noSimple.bed).
  • Separately (in a rule in Snakefile on the cluster), I had counted methylation over promoters (900bp windows around TSS); I load this methylation count and separate promoters in High CpG (HCG) and Low CpG (LCG) density promoters, based on the list of HCG and LCG transcript IDs produced in script ../R/TSSs_CpG.R.
  • Separately (in another rule in Snakefile on the cluster), I had counted methylation over those UCSC CpG Islands (CGI) track for mm10 that do not overlap with the 900-bp promoters.
  • I make violin plots for tiles belonging to the different features, binned based on their width i.e. their CpG density, as well as for promoters and non-promoter CGIs.

For each genotype and time point, the average of the two different biological replicates is shown.

4 A closer look at regions which are more resistant to demethylation

Since most of the tiles/promoters/CGIs are demethylated in KO cells, it is more interesting to look at high CpG genomic regions which are more resistant to methylation loss.

The scripts ../R/CpG_tiles_counts.R, ../R/promoter_counts.R and ../R/nonPromoter_CGI_counts.R, run on the cluster, also perform differential methylation analysis of tiles, promoters and non-promoter CGIs, respectively. For the diff meth analysis I choose logistic regression and Chi-squared test with overdispersion.

I compute Differentially Methylated tiles/promoters/CGIs both:

4.1 Hypermethylated promoters and CGIs

I firstly check whether there are significantly hypermethylated promoters and CGIs.

The test found:

  • 10, 22, 4 promoters out of 88781, 89952, 92243 to be at least 40% hypermethylated below my threshold qvalue < 0.05 between WT and KO at the 0, 14 days and 7 days time points, respectively
  • 2, 1, 0 non-promoter CGIs out of 4802, 4909, 5041 to be at least 40% hypermethylated below my threshold qvalue < 0.05 between WT and KO at the 0, 14 days and 7 days time points, respectively

However, only 0 and 0 hypermeth promoters and CGIs are shared between time points, respectively.

4.2 Different dynamics of demethylation - some regions are indeed more resistant to demethylation

Even though there is no significant hypermethylation, some high CpG genomic regions could loose methylation at later time points / to a lesser extent than others. To investigate this, I plot both a heatmap and a scatterplot.

4.2.1 Heatmap

For the heatmap below, I choose those HCG promoters which have a percentage of methylation in WT condition at time 0 > 48.6474501 (10th percentile of meth percentage in WT condition at time 0) and cluster them based on the difference between WT and KO methylation for the three time points. The heatmap shows that there are clusters of promoters which loose all methylation already at Day0, clusters which loose it at Day7 and clusters which loose it later. Hclustering does not seem to do a good job though, therefore I will use kmean clustering (see below).

4.2.2 Scatterplot

I plot only 50 CpG tiles, HCG promoters and non-promoter CGIs having a perCpg higher than 1/16*100.

4.3 Kmeans clustering of HCG regions with high percentage of WT methylation

4.3.1 Tiles

4.3.2 Promoters

4.4 Writing to external files for continuing investigation on databases

I write to external files:

  1. the coordinates of HCG tiles/HCG promoters/non-promoters CGIs which are in the resistant cluster i.e. the cluster more at the top right at all time points
  2. the coordinates of HCG tiles/HCG promoters/non-promoters CGIs which are in all the other three non-resistant clusters i.e. all highly methylated and HCG ranges, because I will use them as background for motif-searching algorithms
  3. the coordinates, with also info on the associated transcripts, of HCG promoters belonging to two clusters:
  • highly methylated in WT but immediately loosing methylation at D0: D0 loosing cluster
  • highly methylated in WT but loosing methylation to a lesser extent than the D0 loosing cluster: less resistant cluster

4.5 Gene Ontology over-representation test for resistant and non-resistant promoters

I use package ‘clusterProfiler’ to perform both Gene Ontology (GO) Over-representation Test and GO Gene Set Enrichment Test on the list of genes associated with promoters belonging to the resistant cluster and promoters belonging to the D0 loosing and less resistant clusters.

IMPORTANT: since the number of genes is very small, it is more an annotation than a biologically meaningful enrichment.

highly resistant
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:2000241 GO:2000241 regulation of reproductive process 3/13 204/22853 0.0001877 0.0356558 0.0198527 Dazl/Tpst2/Runx1 3
less resistant
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:0010528 GO:0010528 regulation of transposition 2/8 29/22853 0.0000433 0.0019682 0.0007797 Ddx4/Btbd18 2
GO:0010529 GO:0010529 negative regulation of transposition 2/8 29/22853 0.0000433 0.0019682 0.0007797 Ddx4/Btbd18 2
GO:0007140 GO:0007140 male meiotic nuclear division 2/8 56/22853 0.0001636 0.0050710 0.0020089 Ddx4/Btbd18 2
GO:0061982 GO:0061982 meiosis I cell cycle process 2/8 135/22853 0.0009476 0.0220309 0.0087276 Ddx4/Btbd18 2
GO:0140013 GO:0140013 meiotic nuclear division 2/8 199/22853 0.0020409 0.0421780 0.0167089 Ddx4/Btbd18 2
GO:1903046 GO:1903046 meiotic cell cycle process 2/8 219/22853 0.0024642 0.0446206 0.0176765 Ddx4/Btbd18 2
GO:0051321 GO:0051321 meiotic cell cycle 2/8 318/22853 0.0051134 0.0446206 0.0176765 Ddx4/Btbd18 2
GO:0021819 GO:0021819 layer formation in cerebral cortex 1/8 15/22853 0.0052397 0.0446206 0.0176765 Lrp8 1
GO:0051962 GO:0051962 positive regulation of nervous system development 2/8 358/22853 0.0064376 0.0446206 0.0176765 Mir219a-2/Lrp8 2
GO:0043046 GO:0043046 DNA methylation involved in gamete generation 1/8 19/22853 0.0066329 0.0446206 0.0176765 Ddx4 1
GO:0060749 GO:0060749 mammary gland alveolus development 1/8 22/22853 0.0076767 0.0446206 0.0176765 Hoxa5 1
GO:0061377 GO:0061377 mammary gland lobule development 1/8 22/22853 0.0076767 0.0446206 0.0176765 Hoxa5 1
GO:1900006 GO:1900006 positive regulation of dendrite development 1/8 24/22853 0.0083720 0.0462325 0.0183151 Lrp8 1
GO:0022412 GO:0022412 cellular process involved in reproduction in multicellular organism 2/8 478/22853 0.0112449 0.0462325 0.0183151 Ddx4/Btbd18 2
G0 loosing
ID Description GeneRatio BgRatio pvalue p.adjust qvalue geneID Count
GO:0098598 GO:0098598 learned vocalization behavior or vocal learning 2/9 10/22853 0.0000062 0.0016237 0.0007449 Shank3/Nrxn1 2
GO:0031223 GO:0031223 auditory behavior 2/9 13/22853 0.0000107 0.0016237 0.0007449 Shank3/Nrxn1 2
GO:0050795 GO:0050795 regulation of behavior 2/9 100/22853 0.0006689 0.0092658 0.0042508 Shank3/Nrxn1 2
GO:0022412 GO:0022412 cellular process involved in reproduction in multicellular organism 3/9 478/22853 0.0006954 0.0092882 0.0042611 Meioc/Dmrtb1/Fkbp6 3
GO:0010469 GO:0010469 regulation of signaling receptor activity 2/9 147/22853 0.0014363 0.0167864 0.0077011 Shank3/Nrxn1 2
GO:0009612 GO:0009612 response to mechanical stimulus 2/9 178/22853 0.0020951 0.0233890 0.0107301 Shank3/Nrxn1 2
GO:0007612 GO:0007612 learning 2/9 182/22853 0.0021888 0.0233890 0.0107301 Shank3/Nrxn1 2
GO:0021684 GO:0021684 cerebellar granular layer formation 1/9 10/22853 0.0039320 0.0288348 0.0132285 Nrxn1 1
GO:0090129 GO:0090129 positive regulation of synapse maturation 1/9 10/22853 0.0039320 0.0288348 0.0132285 Nrxn1 1
GO:0021683 GO:0021683 cerebellar granular layer morphogenesis 1/9 12/22853 0.0047168 0.0301874 0.0138490 Nrxn1 1
GO:1903365 GO:1903365 regulation of fear response 1/9 12/22853 0.0047168 0.0301874 0.0138490 Shank3 1
GO:2000822 GO:2000822 regulation of behavioral fear response 1/9 12/22853 0.0047168 0.0301874 0.0138490 Shank3 1
GO:0050803 GO:0050803 regulation of synapse structure or activity 2/9 271/22853 0.0047740 0.0301874 0.0138490 Shank3/Nrxn1 2
GO:0032409 GO:0032409 regulation of transporter activity 2/9 297/22853 0.0057054 0.0323308 0.0148323 Shank3/Nrxn1 2
GO:0021681 GO:0021681 cerebellar granular layer development 1/9 16/22853 0.0062846 0.0335778 0.0154044 Nrxn1 1
GO:0090128 GO:0090128 regulation of synapse maturation 1/9 16/22853 0.0062846 0.0335778 0.0154044 Nrxn1 1
GO:0051321 GO:0051321 meiotic cell cycle 2/9 318/22853 0.0065142 0.0336108 0.0154195 Meioc/Fkbp6 2
GO:0050806 GO:0050806 positive regulation of synaptic transmission 2/9 329/22853 0.0069577 0.0336108 0.0154195 Shank3/Nrxn1 2
GO:0021756 GO:0021756 striatum development 1/9 19/22853 0.0074591 0.0336108 0.0154195 Shank3 1
GO:0035641 GO:0035641 locomotory exploration behavior 1/9 19/22853 0.0074591 0.0336108 0.0154195 Shank3 1
GO:0043046 GO:0043046 DNA methylation involved in gamete generation 1/9 19/22853 0.0074591 0.0336108 0.0154195 Fkbp6 1
GO:0007281 GO:0007281 germ cell development 2/9 357/22853 0.0081475 0.0352145 0.0161553 Meioc/Dmrtb1 2
GO:0051962 GO:0051962 positive regulation of nervous system development 2/9 358/22853 0.0081916 0.0352145 0.0161553 Shank3/Nrxn1 2
GO:0021544 GO:0021544 subpallium development 1/9 25/22853 0.0098043 0.0402945 0.0184858 Shank3 1
GO:0045070 GO:0045070 positive regulation of viral genome replication 1/9 28/22853 0.0109750 0.0432069 0.0198219 Fkbp6 1
GO:0051446 GO:0051446 positive regulation of meiotic cell cycle 1/9 30/22853 0.0117548 0.0448603 0.0205804 Meioc 1
GO:0060074 GO:0060074 synapse maturation 1/9 31/22853 0.0121445 0.0454206 0.0208375 Nrxn1 1
GO:0021697 GO:0021697 cerebellar cortex formation 1/9 32/22853 0.0125341 0.0464134 0.0212930 Nrxn1 1
GO:0042391 GO:0042391 regulation of membrane potential 2/9 466/22853 0.0135847 0.0474495 0.0217683 Shank3/Nrxn1 2
GO:0006884 GO:0006884 cell volume homeostasis 1/9 35/22853 0.0137020 0.0474495 0.0217683 Shank3 1
GO:0001964 GO:0001964 startle response 1/9 37/22853 0.0144799 0.0487796 0.0223785 Nrxn1 1
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=it_IT.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=it_IT.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=it_IT.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] org.Mm.eg.db_3.14.0   AnnotationDbi_1.56.2  Biobase_2.54.0       
##  [4] clusterProfiler_4.2.2 knitr_1.40            biomaRt_2.50.3       
##  [7] gridExtra_2.3         pheatmap_1.0.12       genomation_1.26.0    
## [10] ggpubr_0.4.0          purrr_0.3.5           rtracklayer_1.54.0   
## [13] data.table_1.14.4     reshape2_1.4.4        ggplot2_3.3.6        
## [16] methylKit_1.20.0      GenomicRanges_1.46.1  GenomeInfoDb_1.30.1  
## [19] IRanges_2.28.0        S4Vectors_0.32.4      BiocGenerics_0.40.0  
## 
## loaded via a namespace (and not attached):
##   [1] utf8_1.2.2                  R.utils_2.12.0             
##   [3] tidyselect_1.2.0            RSQLite_2.2.18             
##   [5] BiocParallel_1.28.3         scatterpie_0.1.8           
##   [7] munsell_0.5.0               codetools_0.2-18           
##   [9] withr_2.5.0                 colorspace_2.0-3           
##  [11] GOSemSim_2.20.0             filelock_1.0.2             
##  [13] rstudioapi_0.14             ggsignif_0.6.4             
##  [15] fastseg_1.40.0              DOSE_3.20.1                
##  [17] MatrixGenerics_1.6.0        bbmle_1.0.25               
##  [19] GenomeInfoDbData_1.2.7      polyclip_1.10-4            
##  [21] seqPattern_1.26.0           bit64_4.0.5                
##  [23] farver_2.1.1                downloader_0.4             
##  [25] coda_0.19-4                 vctrs_0.5.0                
##  [27] treeio_1.18.1               generics_0.1.3             
##  [29] xfun_0.34                   BiocFileCache_2.2.1        
##  [31] R6_2.5.1                    graphlayouts_0.8.3         
##  [33] bitops_1.0-7                cachem_1.0.6               
##  [35] fgsea_1.20.0                gridGraphics_0.5-1         
##  [37] DelayedArray_0.20.0         assertthat_0.2.1           
##  [39] BiocIO_1.4.0                scales_1.2.1               
##  [41] ggraph_2.1.0                enrichplot_1.14.2          
##  [43] gtable_0.3.1                tidygraph_1.2.2            
##  [45] rlang_1.0.6                 splines_4.1.3              
##  [47] rstatix_0.7.0               lazyeval_0.2.2             
##  [49] impute_1.68.0               broom_1.0.1                
##  [51] yaml_2.3.6                  abind_1.4-5                
##  [53] backports_1.4.1             qvalue_2.26.0              
##  [55] tools_4.1.3                 ggplotify_0.1.0            
##  [57] gridBase_0.4-7              ellipsis_0.3.2             
##  [59] jquerylib_0.1.4             RColorBrewer_1.1-3         
##  [61] Rcpp_1.0.9                  plyr_1.8.7                 
##  [63] progress_1.2.2              zlibbioc_1.40.0            
##  [65] RCurl_1.98-1.9              prettyunits_1.1.1          
##  [67] viridis_0.6.2               SummarizedExperiment_1.24.0
##  [69] ggrepel_0.9.1               magrittr_2.0.3             
##  [71] DO.db_2.9                   mvtnorm_1.1-3              
##  [73] matrixStats_0.62.0          hms_1.1.2                  
##  [75] patchwork_1.1.2             evaluate_0.17              
##  [77] XML_3.99-0.11               emdbook_1.3.12             
##  [79] mclust_5.4.10               compiler_4.1.3             
##  [81] bdsmatrix_1.3-6             tibble_3.1.8               
##  [83] shadowtext_0.1.2            KernSmooth_2.23-20         
##  [85] crayon_1.5.2                R.oo_1.25.0                
##  [87] htmltools_0.5.3             ggfun_0.0.7                
##  [89] tzdb_0.3.0                  tidyr_1.2.1                
##  [91] aplot_0.1.8                 DBI_1.1.3                  
##  [93] tweenr_2.0.2                dbplyr_2.2.1               
##  [95] MASS_7.3-55                 rappdirs_0.3.3             
##  [97] Matrix_1.5-1                car_3.1-1                  
##  [99] readr_2.1.3                 cli_3.4.1                  
## [101] R.methodsS3_1.8.2           parallel_4.1.3             
## [103] igraph_1.3.5                pkgconfig_2.0.3            
## [105] prettydoc_0.4.1             GenomicAlignments_1.30.0   
## [107] numDeriv_2016.8-1.1         xml2_1.3.3                 
## [109] ggtree_3.2.1                bslib_0.4.0                
## [111] XVector_0.34.0              yulab.utils_0.0.5          
## [113] stringr_1.4.1               digest_0.6.30              
## [115] Biostrings_2.62.0           rmarkdown_2.17             
## [117] fastmatch_1.1-3             tidytree_0.4.1             
## [119] restfulr_0.0.15             curl_4.3.3                 
## [121] Rsamtools_2.10.0            gtools_3.9.3               
## [123] rjson_0.2.21                lifecycle_1.0.3            
## [125] nlme_3.1-155                jsonlite_1.8.3             
## [127] carData_3.0-5               viridisLite_0.4.1          
## [129] limma_3.50.3                BSgenome_1.62.0            
## [131] fansi_1.0.3                 pillar_1.8.1               
## [133] lattice_0.20-45             KEGGREST_1.34.0            
## [135] fastmap_1.1.0               httr_1.4.4                 
## [137] plotrix_3.8-2               GO.db_3.14.0               
## [139] glue_1.6.2                  png_0.1-7                  
## [141] bit_4.0.4                   ggforce_0.4.1              
## [143] stringi_1.7.8               sass_0.4.2                 
## [145] blob_1.2.3                  memoise_2.0.1              
## [147] dplyr_1.0.10                ape_5.6-2