1 Reading and quantifying the data
I load:
- the methylRawList objects (objects of package methylKit) produced by R scripts ../R/CpG_tiles_counts.R, ../R/promoter_counts.R and ../R/nonPromoter_CGI_counts.R, run in a rule of the Snakefile on the cluster
- the table containing samples’ information
SRR | sample | condition | group_or_time_point | biol_rep | tech_rep | run | library_layout | read_length | sample_name |
---|---|---|---|---|---|---|---|---|---|
none | BS_EMSeq_D0-1 | WT | 1 | 1 | 1 | 1 | SINGLE - | 150 | D0-1 |
none | BS_EMSeq_D0-2 | KO | 1 | 1 | 1 | 1 | SINGLE - | 150 | D0-2 |
none | BS_EMSeq_D0-3 | WT | 1 | 2 | 1 | 1 | SINGLE - | 150 | D0-3 |
none | BS_EMSeq_D0-4 | KO | 1 | 2 | 1 | 1 | SINGLE - | 150 | D0-4 |
none | BS_EMSeq_D14-1 | KO | 3 | 1 | 1 | 1 | SINGLE - | 150 | D14-1 |
none | BS_EMSeq_D14-3 | WT | 3 | 1 | 1 | 1 | SINGLE - | 150 | D14-3 |
none | BS_EMSeq_D14-4 | KO | 3 | 2 | 1 | 1 | SINGLE - | 150 | D14-4 |
none | BS_EMSeq_D14-5 | WT | 3 | 2 | 1 | 1 | SINGLE - | 150 | D14-5 |
none | BS_EMSeq_D7-1 | WT | 2 | 1 | 1 | 1 | SINGLE - | 150 | D7-1 |
none | BS_EMSeq_D7-2 | KO | 2 | 1 | 1 | 1 | SINGLE - | 150 | D7-2 |
none | BS_EMSeq_D7-4 | KO | 2 | 2 | 1 | 1 | SINGLE - | 150 | D7-4 |
none | BS_EMSeq_D7-5 | WT | 2 | 2 | 1 | 1 | SINGLE - | 150 | D7-5 |
2 Data exploration
2.1 Global methylation statistics for 200 CpG tiles
2.2 Checking any presence of batch effects using methylation count over 200 CpG tiles
3 Methylation by genomic features
I annotate tiles with genomic features using genome annotation ../../data/annotations/gencode.vM25.annotation.bed.
3.1 Relationship between CpG density of a tile and its bias towards certain features
I assume that tiles of different width (i.e. different CpG density) will be biased towards certain features because of the organization of CpGs in the genome.
Therefore, I first of all bin the width of tiles and plot the annotation result using different bins of tiles’ width i.e. different bins of CpG density.
It is clear that the bins [0, 1000), [1000, 5000) and > 10000 are more associated to CpG rich promoters, gene body and intergenic regions (and CpG shores), respectively.
3.2 Methylation distribution for tiles annotated by genomic features, High and Low CpG promoters and non-promoters CpG Islands
- I annotate the 50 CpG tiles with genomic features, except for promoters, using the following precedence: promoter, exon, intron, intergenic, HCG_promoters, LCG_promoters, non_promoter_CGI. From tiles annotated as intergenic, I remove all the ones which also overlap a repeat element used to count over Transposable Elements (i.e. a repeat element in ../../data/annotations/RepeatMasker_RepeatLibrary20140131_mm10.noGenes.noSimple.bed).
- Separately (in a rule in Snakefile on the cluster), I had counted methylation over promoters (900bp windows around TSS); I load this methylation count and separate promoters in High CpG (HCG) and Low CpG (LCG) density promoters, based on the list of HCG and LCG transcript IDs produced in script ../R/TSSs_CpG.R.
- Separately (in another rule in Snakefile on the cluster), I had counted methylation over those UCSC CpG Islands (CGI) track for mm10 that do not overlap with the 900-bp promoters.
- I make violin plots for tiles belonging to the different features, binned based on their width i.e. their CpG density, as well as for promoters and non-promoter CGIs.
For each genotype and time point, the average of the two different biological replicates is shown.
4 A closer look at regions which are more resistant to demethylation
Since most of the tiles/promoters/CGIs are demethylated in KO cells, it is more interesting to look at high CpG genomic regions which are more resistant to methylation loss.
The scripts ../R/CpG_tiles_counts.R, ../R/promoter_counts.R and ../R/nonPromoter_CGI_counts.R, run on the cluster, also perform differential methylation analysis of tiles, promoters and non-promoter CGIs, respectively. For the diff meth analysis I choose logistic regression and Chi-squared test with overdispersion.
I compute Differentially Methylated tiles/promoters/CGIs both:
- between WT and KO genotype, for each time point
- between time points, for each genotype
4.1 Hypermethylated promoters and CGIs
I firstly check whether there are significantly hypermethylated promoters and CGIs.
The test found:
- 10, 22, 4 promoters out of 88781, 89952, 92243 to be at least 40% hypermethylated below my threshold qvalue < 0.05 between WT and KO at the 0, 14 days and 7 days time points, respectively
- 2, 1, 0 non-promoter CGIs out of 4802, 4909, 5041 to be at least 40% hypermethylated below my threshold qvalue < 0.05 between WT and KO at the 0, 14 days and 7 days time points, respectively
However, only 0 and 0 hypermeth promoters and CGIs are shared between time points, respectively.
4.2 Different dynamics of demethylation - some regions are indeed more resistant to demethylation
Even though there is no significant hypermethylation, some high CpG genomic regions could loose methylation at later time points / to a lesser extent than others. To investigate this, I plot both a heatmap and a scatterplot.
4.2.1 Heatmap
For the heatmap below, I choose those HCG promoters which have a percentage of methylation in WT condition at time 0 > 48.6474501 (10th percentile of meth percentage in WT condition at time 0) and cluster them based on the difference between WT and KO methylation for the three time points. The heatmap shows that there are clusters of promoters which loose all methylation already at Day0, clusters which loose it at Day7 and clusters which loose it later. Hclustering does not seem to do a good job though, therefore I will use kmean clustering (see below).
4.2.2 Scatterplot
I plot only 50 CpG tiles, HCG promoters and non-promoter CGIs having a perCpg higher than 1/16*100.
4.3 Kmeans clustering of HCG regions with high percentage of WT methylation
4.3.1 Tiles
4.3.2 Promoters
4.4 Writing to external files for continuing investigation on databases
I write to external files:
- the coordinates of HCG tiles/HCG promoters/non-promoters CGIs which are in the resistant cluster i.e. the cluster more at the top right at all time points
- the coordinates of HCG tiles/HCG promoters/non-promoters CGIs which are in all the other three non-resistant clusters i.e. all highly methylated and HCG ranges, because I will use them as background for motif-searching algorithms
- the coordinates, with also info on the associated transcripts, of HCG promoters belonging to two clusters:
- highly methylated in WT but immediately loosing methylation at D0: D0 loosing cluster
- highly methylated in WT but loosing methylation to a lesser extent than the D0 loosing cluster: less resistant cluster
4.5 Gene Ontology over-representation test for resistant and non-resistant promoters
I use package ‘clusterProfiler’ to perform both Gene Ontology (GO) Over-representation Test and GO Gene Set Enrichment Test on the list of genes associated with promoters belonging to the resistant cluster and promoters belonging to the D0 loosing and less resistant clusters.
IMPORTANT: since the number of genes is very small, it is more an annotation than a biologically meaningful enrichment.
ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|
GO:2000241 | GO:2000241 | regulation of reproductive process | 3/13 | 204/22853 | 0.0001877 | 0.0356558 | 0.0198527 | Dazl/Tpst2/Runx1 | 3 |
ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|
GO:0010528 | GO:0010528 | regulation of transposition | 2/8 | 29/22853 | 0.0000433 | 0.0019682 | 0.0007797 | Ddx4/Btbd18 | 2 |
GO:0010529 | GO:0010529 | negative regulation of transposition | 2/8 | 29/22853 | 0.0000433 | 0.0019682 | 0.0007797 | Ddx4/Btbd18 | 2 |
GO:0007140 | GO:0007140 | male meiotic nuclear division | 2/8 | 56/22853 | 0.0001636 | 0.0050710 | 0.0020089 | Ddx4/Btbd18 | 2 |
GO:0061982 | GO:0061982 | meiosis I cell cycle process | 2/8 | 135/22853 | 0.0009476 | 0.0220309 | 0.0087276 | Ddx4/Btbd18 | 2 |
GO:0140013 | GO:0140013 | meiotic nuclear division | 2/8 | 199/22853 | 0.0020409 | 0.0421780 | 0.0167089 | Ddx4/Btbd18 | 2 |
GO:1903046 | GO:1903046 | meiotic cell cycle process | 2/8 | 219/22853 | 0.0024642 | 0.0446206 | 0.0176765 | Ddx4/Btbd18 | 2 |
GO:0051321 | GO:0051321 | meiotic cell cycle | 2/8 | 318/22853 | 0.0051134 | 0.0446206 | 0.0176765 | Ddx4/Btbd18 | 2 |
GO:0021819 | GO:0021819 | layer formation in cerebral cortex | 1/8 | 15/22853 | 0.0052397 | 0.0446206 | 0.0176765 | Lrp8 | 1 |
GO:0051962 | GO:0051962 | positive regulation of nervous system development | 2/8 | 358/22853 | 0.0064376 | 0.0446206 | 0.0176765 | Mir219a-2/Lrp8 | 2 |
GO:0043046 | GO:0043046 | DNA methylation involved in gamete generation | 1/8 | 19/22853 | 0.0066329 | 0.0446206 | 0.0176765 | Ddx4 | 1 |
GO:0060749 | GO:0060749 | mammary gland alveolus development | 1/8 | 22/22853 | 0.0076767 | 0.0446206 | 0.0176765 | Hoxa5 | 1 |
GO:0061377 | GO:0061377 | mammary gland lobule development | 1/8 | 22/22853 | 0.0076767 | 0.0446206 | 0.0176765 | Hoxa5 | 1 |
GO:1900006 | GO:1900006 | positive regulation of dendrite development | 1/8 | 24/22853 | 0.0083720 | 0.0462325 | 0.0183151 | Lrp8 | 1 |
GO:0022412 | GO:0022412 | cellular process involved in reproduction in multicellular organism | 2/8 | 478/22853 | 0.0112449 | 0.0462325 | 0.0183151 | Ddx4/Btbd18 | 2 |
ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
---|---|---|---|---|---|---|---|---|---|
GO:0098598 | GO:0098598 | learned vocalization behavior or vocal learning | 2/9 | 10/22853 | 0.0000062 | 0.0016237 | 0.0007449 | Shank3/Nrxn1 | 2 |
GO:0031223 | GO:0031223 | auditory behavior | 2/9 | 13/22853 | 0.0000107 | 0.0016237 | 0.0007449 | Shank3/Nrxn1 | 2 |
GO:0050795 | GO:0050795 | regulation of behavior | 2/9 | 100/22853 | 0.0006689 | 0.0092658 | 0.0042508 | Shank3/Nrxn1 | 2 |
GO:0022412 | GO:0022412 | cellular process involved in reproduction in multicellular organism | 3/9 | 478/22853 | 0.0006954 | 0.0092882 | 0.0042611 | Meioc/Dmrtb1/Fkbp6 | 3 |
GO:0010469 | GO:0010469 | regulation of signaling receptor activity | 2/9 | 147/22853 | 0.0014363 | 0.0167864 | 0.0077011 | Shank3/Nrxn1 | 2 |
GO:0009612 | GO:0009612 | response to mechanical stimulus | 2/9 | 178/22853 | 0.0020951 | 0.0233890 | 0.0107301 | Shank3/Nrxn1 | 2 |
GO:0007612 | GO:0007612 | learning | 2/9 | 182/22853 | 0.0021888 | 0.0233890 | 0.0107301 | Shank3/Nrxn1 | 2 |
GO:0021684 | GO:0021684 | cerebellar granular layer formation | 1/9 | 10/22853 | 0.0039320 | 0.0288348 | 0.0132285 | Nrxn1 | 1 |
GO:0090129 | GO:0090129 | positive regulation of synapse maturation | 1/9 | 10/22853 | 0.0039320 | 0.0288348 | 0.0132285 | Nrxn1 | 1 |
GO:0021683 | GO:0021683 | cerebellar granular layer morphogenesis | 1/9 | 12/22853 | 0.0047168 | 0.0301874 | 0.0138490 | Nrxn1 | 1 |
GO:1903365 | GO:1903365 | regulation of fear response | 1/9 | 12/22853 | 0.0047168 | 0.0301874 | 0.0138490 | Shank3 | 1 |
GO:2000822 | GO:2000822 | regulation of behavioral fear response | 1/9 | 12/22853 | 0.0047168 | 0.0301874 | 0.0138490 | Shank3 | 1 |
GO:0050803 | GO:0050803 | regulation of synapse structure or activity | 2/9 | 271/22853 | 0.0047740 | 0.0301874 | 0.0138490 | Shank3/Nrxn1 | 2 |
GO:0032409 | GO:0032409 | regulation of transporter activity | 2/9 | 297/22853 | 0.0057054 | 0.0323308 | 0.0148323 | Shank3/Nrxn1 | 2 |
GO:0021681 | GO:0021681 | cerebellar granular layer development | 1/9 | 16/22853 | 0.0062846 | 0.0335778 | 0.0154044 | Nrxn1 | 1 |
GO:0090128 | GO:0090128 | regulation of synapse maturation | 1/9 | 16/22853 | 0.0062846 | 0.0335778 | 0.0154044 | Nrxn1 | 1 |
GO:0051321 | GO:0051321 | meiotic cell cycle | 2/9 | 318/22853 | 0.0065142 | 0.0336108 | 0.0154195 | Meioc/Fkbp6 | 2 |
GO:0050806 | GO:0050806 | positive regulation of synaptic transmission | 2/9 | 329/22853 | 0.0069577 | 0.0336108 | 0.0154195 | Shank3/Nrxn1 | 2 |
GO:0021756 | GO:0021756 | striatum development | 1/9 | 19/22853 | 0.0074591 | 0.0336108 | 0.0154195 | Shank3 | 1 |
GO:0035641 | GO:0035641 | locomotory exploration behavior | 1/9 | 19/22853 | 0.0074591 | 0.0336108 | 0.0154195 | Shank3 | 1 |
GO:0043046 | GO:0043046 | DNA methylation involved in gamete generation | 1/9 | 19/22853 | 0.0074591 | 0.0336108 | 0.0154195 | Fkbp6 | 1 |
GO:0007281 | GO:0007281 | germ cell development | 2/9 | 357/22853 | 0.0081475 | 0.0352145 | 0.0161553 | Meioc/Dmrtb1 | 2 |
GO:0051962 | GO:0051962 | positive regulation of nervous system development | 2/9 | 358/22853 | 0.0081916 | 0.0352145 | 0.0161553 | Shank3/Nrxn1 | 2 |
GO:0021544 | GO:0021544 | subpallium development | 1/9 | 25/22853 | 0.0098043 | 0.0402945 | 0.0184858 | Shank3 | 1 |
GO:0045070 | GO:0045070 | positive regulation of viral genome replication | 1/9 | 28/22853 | 0.0109750 | 0.0432069 | 0.0198219 | Fkbp6 | 1 |
GO:0051446 | GO:0051446 | positive regulation of meiotic cell cycle | 1/9 | 30/22853 | 0.0117548 | 0.0448603 | 0.0205804 | Meioc | 1 |
GO:0060074 | GO:0060074 | synapse maturation | 1/9 | 31/22853 | 0.0121445 | 0.0454206 | 0.0208375 | Nrxn1 | 1 |
GO:0021697 | GO:0021697 | cerebellar cortex formation | 1/9 | 32/22853 | 0.0125341 | 0.0464134 | 0.0212930 | Nrxn1 | 1 |
GO:0042391 | GO:0042391 | regulation of membrane potential | 2/9 | 466/22853 | 0.0135847 | 0.0474495 | 0.0217683 | Shank3/Nrxn1 | 2 |
GO:0006884 | GO:0006884 | cell volume homeostasis | 1/9 | 35/22853 | 0.0137020 | 0.0474495 | 0.0217683 | Shank3 | 1 |
GO:0001964 | GO:0001964 | startle response | 1/9 | 37/22853 | 0.0144799 | 0.0487796 | 0.0223785 | Nrxn1 | 1 |
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Mm.eg.db_3.14.0 AnnotationDbi_1.56.2 Biobase_2.54.0
## [4] clusterProfiler_4.2.2 knitr_1.40 biomaRt_2.50.3
## [7] gridExtra_2.3 pheatmap_1.0.12 genomation_1.26.0
## [10] ggpubr_0.4.0 purrr_0.3.5 rtracklayer_1.54.0
## [13] data.table_1.14.4 reshape2_1.4.4 ggplot2_3.3.6
## [16] methylKit_1.20.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
## [19] IRanges_2.28.0 S4Vectors_0.32.4 BiocGenerics_0.40.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 R.utils_2.12.0
## [3] tidyselect_1.2.0 RSQLite_2.2.18
## [5] BiocParallel_1.28.3 scatterpie_0.1.8
## [7] munsell_0.5.0 codetools_0.2-18
## [9] withr_2.5.0 colorspace_2.0-3
## [11] GOSemSim_2.20.0 filelock_1.0.2
## [13] rstudioapi_0.14 ggsignif_0.6.4
## [15] fastseg_1.40.0 DOSE_3.20.1
## [17] MatrixGenerics_1.6.0 bbmle_1.0.25
## [19] GenomeInfoDbData_1.2.7 polyclip_1.10-4
## [21] seqPattern_1.26.0 bit64_4.0.5
## [23] farver_2.1.1 downloader_0.4
## [25] coda_0.19-4 vctrs_0.5.0
## [27] treeio_1.18.1 generics_0.1.3
## [29] xfun_0.34 BiocFileCache_2.2.1
## [31] R6_2.5.1 graphlayouts_0.8.3
## [33] bitops_1.0-7 cachem_1.0.6
## [35] fgsea_1.20.0 gridGraphics_0.5-1
## [37] DelayedArray_0.20.0 assertthat_0.2.1
## [39] BiocIO_1.4.0 scales_1.2.1
## [41] ggraph_2.1.0 enrichplot_1.14.2
## [43] gtable_0.3.1 tidygraph_1.2.2
## [45] rlang_1.0.6 splines_4.1.3
## [47] rstatix_0.7.0 lazyeval_0.2.2
## [49] impute_1.68.0 broom_1.0.1
## [51] yaml_2.3.6 abind_1.4-5
## [53] backports_1.4.1 qvalue_2.26.0
## [55] tools_4.1.3 ggplotify_0.1.0
## [57] gridBase_0.4-7 ellipsis_0.3.2
## [59] jquerylib_0.1.4 RColorBrewer_1.1-3
## [61] Rcpp_1.0.9 plyr_1.8.7
## [63] progress_1.2.2 zlibbioc_1.40.0
## [65] RCurl_1.98-1.9 prettyunits_1.1.1
## [67] viridis_0.6.2 SummarizedExperiment_1.24.0
## [69] ggrepel_0.9.1 magrittr_2.0.3
## [71] DO.db_2.9 mvtnorm_1.1-3
## [73] matrixStats_0.62.0 hms_1.1.2
## [75] patchwork_1.1.2 evaluate_0.17
## [77] XML_3.99-0.11 emdbook_1.3.12
## [79] mclust_5.4.10 compiler_4.1.3
## [81] bdsmatrix_1.3-6 tibble_3.1.8
## [83] shadowtext_0.1.2 KernSmooth_2.23-20
## [85] crayon_1.5.2 R.oo_1.25.0
## [87] htmltools_0.5.3 ggfun_0.0.7
## [89] tzdb_0.3.0 tidyr_1.2.1
## [91] aplot_0.1.8 DBI_1.1.3
## [93] tweenr_2.0.2 dbplyr_2.2.1
## [95] MASS_7.3-55 rappdirs_0.3.3
## [97] Matrix_1.5-1 car_3.1-1
## [99] readr_2.1.3 cli_3.4.1
## [101] R.methodsS3_1.8.2 parallel_4.1.3
## [103] igraph_1.3.5 pkgconfig_2.0.3
## [105] prettydoc_0.4.1 GenomicAlignments_1.30.0
## [107] numDeriv_2016.8-1.1 xml2_1.3.3
## [109] ggtree_3.2.1 bslib_0.4.0
## [111] XVector_0.34.0 yulab.utils_0.0.5
## [113] stringr_1.4.1 digest_0.6.30
## [115] Biostrings_2.62.0 rmarkdown_2.17
## [117] fastmatch_1.1-3 tidytree_0.4.1
## [119] restfulr_0.0.15 curl_4.3.3
## [121] Rsamtools_2.10.0 gtools_3.9.3
## [123] rjson_0.2.21 lifecycle_1.0.3
## [125] nlme_3.1-155 jsonlite_1.8.3
## [127] carData_3.0-5 viridisLite_0.4.1
## [129] limma_3.50.3 BSgenome_1.62.0
## [131] fansi_1.0.3 pillar_1.8.1
## [133] lattice_0.20-45 KEGGREST_1.34.0
## [135] fastmap_1.1.0 httr_1.4.4
## [137] plotrix_3.8-2 GO.db_3.14.0
## [139] glue_1.6.2 png_0.1-7
## [141] bit_4.0.4 ggforce_0.4.1
## [143] stringi_1.7.8 sass_0.4.2
## [145] blob_1.2.3 memoise_2.0.1
## [147] dplyr_1.0.10 ape_5.6-2