1 Reading the data
The input files are:
- the output tables made by featureCounts (run in rule ‘count_on_TE’ of the Snakefile), which contain read counts of all repetitive elements (i.e. all repNames) for all samples
- the repetitive elements annotation ../../data/annotations/RepeatMasker_RepeatLibrary20140131_mm10.noGenes.noSimple.bed (converted to SAF format in rule of Snakefile) used for featureCounts command, that is used to retrieve repFamily and repClass for each repName. The annotation I chose is the most updated library to Nov2020 of rmsk, from which I removed ‘Simple repeats’ and ‘Low complexity regions’
- a table with STAR input reads per sample
- the samples’ table, containing info on experimental design:
2 Analysis at family level
2.1 FPKM for each family of Repetitive Elements
For each family of Repetitive Elements (in case of elements with no repFamily name or repFamilies belonging to more than one repClass I use repClass) I compute FPKM values as follows: for each sample:
- I compute the sum of counts for all elements belonging to that repFamily
- I divide this sum by the total number of STAR input reads (/2 because with featureCounts I quantify fragments) for that sample and multiply by 10⁶
- I divide this number by the total sum of lengths (in Kb) of the elements belonging to that repFamily –> FPKM
- I subtract from each FPKM the total FPKM of all transposons belonging to the DNA repClass.
2.2 FPKM of specific families per condition
The values are the FPKM corrected for DNA transposons.
2.2.1 testing the difference in the mean FPKM per condition
I use t-test on the FPKM values corrected for DNA transposons.
[[1]] [1] “L1 - F pval = 0.125541125541126 - M pval = 0.536796536796537”
[[2]] [1] “ERVK - F pval = 0.662337662337662 - M pval = 0.125541125541126”
[[3]] [1] “ERVL - F pval = 0.051948051948052 - M pval = 0.0303030303030303”
[[4]] [1] “Satellite - F pval = 0.536796536796537 - M pval = 0.00432900432900433”
3 DE-Seq analysis of RNA transposons
I include the FPKM of DNA transposons as confounding factor in DESeq2 formula.
Before running the Differential Expression analysis, the data are pre-filtered to remove all repetitive elements with < 10 reads among all samples.
3.1 Differentially expressed TEs
## log2 fold change (MMSE): condition M_HEMI vs M_WT
## Wald test p-value: condition M_HEMI vs M_WT
## DataFrame with 4 rows and 5 columns
## baseMean log2FoldChange lfcSE pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric>
## IAPLTR2_Mm 179.523 0.316178 0.1519884 0.000104618 0.0101654
## MERVL-int 527.511 0.258159 0.0914886 0.000145426 0.0110614
## L1MdTf_II 393.994 0.286710 0.1144614 0.000151786 0.0110614
## L1MdTf_I 401.114 0.280728 0.1140062 0.000176763 0.0114503
3.2 MA-plots
- The threshold used for a dot to be coloured in the MA-plots is p-value adjusted < 0.05.
- Transposable elements whose mean expression > 20 and abs(log2FoldChange) > 0.2 are labeled.
3.3 FPKM of differentially expressed TEs
I also include two non-DE TEs, one expressed and one non-expressed in murine trophoblast cells (Ext.Fig3g in Weigert et al. 2023 ).
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=it_IT.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=it_IT.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=it_IT.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=it_IT.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggpubr_0.5.0 ggrepel_0.9.2
## [3] gridExtra_2.3 DESeq2_1.34.0
## [5] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [7] MatrixGenerics_1.6.0 matrixStats_0.62.0
## [9] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
## [11] IRanges_2.28.0 S4Vectors_0.32.4
## [13] BiocGenerics_0.40.0 pheatmap_1.0.12
## [15] data.table_1.14.6 ggplot2_3.4.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 bit64_4.0.5 RColorBrewer_1.1-3
## [4] httr_1.4.4 tools_4.1.3 backports_1.4.1
## [7] bslib_0.4.1 irlba_2.3.5.1 utf8_1.2.2
## [10] R6_2.5.1 DBI_1.1.3 colorspace_2.0-3
## [13] withr_2.5.0 tidyselect_1.2.0 bit_4.0.5
## [16] compiler_4.1.3 cli_3.4.1 DelayedArray_0.20.0
## [19] labeling_0.4.2 sass_0.4.2 scales_1.2.1
## [22] SQUAREM_2021.1 genefilter_1.76.0 mixsqp_0.3-48
## [25] stringr_1.4.1 digest_0.6.30 rmarkdown_2.18
## [28] XVector_0.34.0 pkgconfig_2.0.3 htmltools_0.5.3
## [31] invgamma_1.1 highr_0.9 fastmap_1.1.0
## [34] rlang_1.0.6 rstudioapi_0.14 RSQLite_2.2.18
## [37] prettydoc_0.4.1 farver_2.1.1 jquerylib_0.1.4
## [40] generics_0.1.3 jsonlite_1.8.3 BiocParallel_1.28.3
## [43] car_3.1-1 dplyr_1.0.10 RCurl_1.98-1.9
## [46] magrittr_2.0.3 GenomeInfoDbData_1.2.7 Matrix_1.5-3
## [49] Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3
## [52] abind_1.4-5 lifecycle_1.0.3 stringi_1.7.8
## [55] yaml_2.3.6 carData_3.0-5 zlibbioc_1.40.0
## [58] grid_4.1.3 blob_1.2.3 parallel_4.1.3
## [61] crayon_1.5.2 lattice_0.20-45 Biostrings_2.62.0
## [64] splines_4.1.3 annotate_1.72.0 KEGGREST_1.34.0
## [67] locfit_1.5-9.6 knitr_1.40 pillar_1.8.1
## [70] ggsignif_0.6.4 codetools_0.2-18 geneplotter_1.72.0
## [73] XML_3.99-0.12 glue_1.6.2 evaluate_0.18
## [76] png_0.1-7 vctrs_0.5.1 gtable_0.3.1
## [79] purrr_0.3.5 tidyr_1.2.1 assertthat_0.2.1
## [82] ashr_2.2-54 cachem_1.0.6 xfun_0.35
## [85] xtable_1.8-4 broom_1.0.1 rstatix_0.7.1
## [88] survival_3.2-13 truncnorm_1.0-8 tibble_3.1.8
## [91] AnnotationDbi_1.56.2 memoise_2.0.1