Acknowledgement: this report has been inspired by
materials used in the Advanced Bioinformatics course taught
by Igor Ruiz de los Mozos at UPNA.
After completing the nf-core/rnaseq pipeline using
Salmon pseudoalignment for the analysis of 44 colorectal samples from
SRA study SRP479528 will obtain the
salmon.merged.gene.SummarizedExperiment.rds file. This file
contains gene-level quantification matrices (counts, TPMs, effective
lengths) plus sample metadata and annotations produced from the merged
Salmon quantifications.
Let’s load the
salmon.merged.gene.SummarizedExperiment.rds file. The
content of the file has a SummarizedExperiment
representation, which is an R/Bioconductor container that stores assay
matrices (e.g., counts), feature metadata, and sample metadata in one
coherent object.
# Specify a working directory
directory <- "/Volumes/DataWinBackup/00_MASTER UOC BIOINFORMATICA - TEMPORAL/TFM-UOC"
setwd(directory)
list.files()
[1] "dds_DGE_42.rds"
[2] "DE_tumor_vs_normal_42.csv"
[3] "DE_tumor_vs_normal_44.csv"
[4] "Deseq2_42eocrc"
[5] "Deseq2_44crc copy.Rmd"
[6] "Deseq2_44crc_cache"
[7] "Deseq2_44crc_files"
[8] "Deseq2_44crc-copy_cache"
[9] "Deseq2_44crc-copy.Rmd"
[10] "Deseq2_44crc.html"
[11] "Deseq2_44crc.Rmd"
[12] "DESeq2_sig_genes_normal_vs_cancer_44_locrc_annotated.tab"
[13] "DESeq2_sig_genes_normal_vs_cancer_44_locrc.tab"
[14] "DTU44CRC"
[15] "ego_top15_GO_BP.csv"
[16] "enrichment_map_30genes.pdf"
[17] "figures"
[18] "gene_concept_network_plot_30genes.pdf"
[19] "gene.SummarizedExperiment.metadata_44.rds"
[20] "gene.SummarizedExperiment.metadata.rds"
[21] "images"
[22] "RNAstructure"
[23] "salmon.merged.gene.SummarizedExperiment_gencode.rds"
[24] "salmon.merged.gene.SummarizedExperiment.rds"
[25] "salmon.merged.transcript.SummarizedExperiment.rds"
[26] "sraruntable_srp479528_crc44_metadata.csv"
[27] "sraruntable_srp479528_crc44.csv"
[28] "tables"
[29] "TFM_schematic_thesis.drawio"
[30] "TFM_schematic.drawio"
[31] "TFM_schematic.png"
[32] "tumor_vs_normal_42_dds.csv"
# Read 'se' object obtained from nfcore/rnaseq
se <- readRDS("salmon.merged.gene.SummarizedExperiment_gencode.rds")
# Display se object
print(se)
class: SummarizedExperiment
dim: 78428 44
metadata(0):
assays(5): counts counts_length_scaled counts_scaled lengths tpm
rownames(78428): ENSG00000000003.17 ENSG00000000005.6 ...
ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(4): sample fastq_1 fastq_2 strandedness
We focus on the counts assay matrix and check
information on columns and rows.
# Check the first elements of the counts matrix
head(assay(se, "counts")[, 1:2], 5)
srr27320655 srr27320656
ENSG00000000003.17 1397.945 6337.771
ENSG00000000005.6 6.000 79.000
ENSG00000000419.15 633.000 4413.999
ENSG00000000457.15 434.062 741.001
ENSG00000000460.18 88.000 662.001
# Collect sample information
head(colData(se)[, 2:3], 5)
DataFrame with 5 rows and 2 columns
fastq_1 fastq_2
<character> <character>
srr27320655 data/reads/SRR273206.. data/reads/SRR273206..
srr27320656 data/reads/SRR273206.. data/reads/SRR273206..
srr27320657 data/reads/SRR273206.. data/reads/SRR273206..
srr27320658 data/reads/SRR273206.. data/reads/SRR273206..
srr27320659 data/reads/SRR273206.. data/reads/SRR273206..
# Collect gene information
head(rowData(se), 3)
DataFrame with 3 rows and 3 columns
transcript_id gene_id gene_name
<character> <character> <character>
ENSG00000000003.17 ENST00000373020.9,EN.. ENSG00000000003.17 TSPAN6
ENSG00000000005.6 ENST00000373031.5,EN.. ENSG00000000005.6 TNMD
ENSG00000000419.15 ENST00000858753.1,EN.. ENSG00000000419.15 DPM1
# Collect genomic ranges
head(rowRanges(se), 3)
NULL
Below, we can obtain the name of the 44 samples.
# Display colNames
colData(se)$sample
[1] "srr27320655" "srr27320656" "srr27320657" "srr27320658" "srr27320659"
[6] "srr27320660" "srr27320661" "srr27320662" "srr27320663" "srr27320664"
[11] "srr27320665" "srr27320666" "srr27320667" "srr27320668" "srr27320669"
[16] "srr27320670" "srr27320671" "srr27320672" "srr27320673" "srr27320674"
[21] "srr27320675" "srr27320676" "srr27320677" "srr27320678" "srr27320679"
[26] "srr27320680" "srr27320681" "srr27320682" "srr27320683" "srr27320684"
[31] "srr27320685" "srr27320686" "srr27320687" "srr27320688" "srr27320689"
[36] "srr27320690" "srr27320691" "srr27320692" "srr27320693" "srr27320694"
[41] "srr27320695" "srr27320696" "srr27320697" "srr27320698"
Since the colData field in the se object is
limited to data from the original FASTQ files, it would be useful to
supplement it with metadata from the SRA study, such as patient age and
tissue type.
In this section we describe the steps to add metadata from the SRA
study into the colData field of the
SummarizedExperiment object obtained from the
nf-core/rnaseq pipeline:
SraRunTable.csv file in Google
Sheets.csv file and rename it
sraruntable_srp479528_crc44.csvHere are the first lines of the
sraruntable_srp479528_crc44.csv file.
Run,age_at_surgery,tissue,tumor_stage,Sample Name,SRA Study,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,Bytes,Center Name,Collection_Date,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,Experiment,geo_loc_name_country,geo_loc_name_country_continent,geo_loc_name,Instrument,Library Name,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,create_date,version,Column 1,source_name
SRR27320655,88.75,Rectum,NA,GSM7989032,SRP479528,RNA-Seq,117,2304341839,PRJNA1055547,SAMN39058850,870996594,"BIOCHEMISTRY, COLORECTAL SURGERY, PENN STATE COLLEGE OF MEDICINE",missing,public,"fastq,run.zq,sra","ncbi,gs,s3","s3.us-east-1,ncbi.public,gs.us-east1",SRX22997924,uncalculated,uncalculated,missing,Illumina NovaSeq 6000,GSM7989032,PAIRED,cDNA,TRANSCRIPTOMIC,Homo sapiens,ILLUMINA,2024-04-11T00:00:00Z,2023-12-21T21:33:00Z,1,,Rectum
As follows, a metadata .csv file was created using
information from BioProject
PRJNA1055547. Here are the first lines of the created
sraruntable_srp479528_crc44_metadata.csv file.
key,values
abstract,"The incidence of colorectal cancer (CRC) has been steadily increasing in younger individuals over the past several decades for reasons that are incompletely defined..."
Accession,PRJNA1055547; GEO: GSE251845
Data Type,Transcriptome or Gene expression
Let’s add metadata contained in the previously created
sraruntable_srp479528_crc44.csv and
sraruntable_srp479528_crc44_metadata.csv files into the
se object.
After loading the .csv files, here we can see the samples metadata
represented as a dataframe named df_coldata:
# Load files
df_coldata <- read.csv("sraruntable_srp479528_crc44.csv", header = TRUE, row.names = 1,
check.names = FALSE)
# Display first rows
head(df_coldata[, 1:5], n = 3)
age_at_surgery tissue tumor_stage Sample Name SRA Study
SRR27320655 88.75 Rectum NA GSM7989032 SRP479528
SRR27320656 88.75 Rectum 1 GSM7989031 SRP479528
SRR27320657 68.00 Hepatic flexure NA GSM7989030 SRP479528
and below is the metadata corresponding to the SRA study represented
as a dataframe named df_metadata with two columns:
key and values.
# Load files
df_metadata <- read.csv("sraruntable_srp479528_crc44_metadata.csv", header = TRUE,
check.names = FALSE)
# Display first rows
head(df_metadata, n = 3)
key
1 abstract
2 Accession
3 Data Type
values
1 The incidence of colorectal cancer (CRC) has been steadily increasing in younger individuals over the past several decades for reasons that are incompletely defined. Identifying differences in gene expression profiles, or transcriptomes, in early-onset colorectal cancer (EOCRC, < 50 years old) patients versus later-onset colorectal cancer (LOCRC, > 50 years old) patients is one approach to understanding molecular and genetic features that distinguish EOCRC. Methods: We performed RNA-sequencing (RNA-seq) to characterize the transcriptomes of patient-matched tumors and adjacent, uninvolved (normal) colonic segments from EOCRC (n=21) and LOCRC (n=22) patients. The EOCRC and LOCRC cohorts were matched for demographic and clinical characteristics. We used The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) database for validation. We used a series of computational and bioinformatic tools to identify EOCRC-specific differentially expressed genes, molecular pathways, predicted cell populations, differential gene splicing events, and predicted neoantigens. Results: We identified an eight-gene signature in EOCRC comprised of ALDOB, FBXL16, IL1RN, MSLN, RAC3, SLC38A11, WBSCR27 and WNT11, from which we developed a score predictive of overall CRC patient survival. On the entire set of genes identified in normal tissues and tumors, cell type deconvolution analysis predicted a differential abundance of immune and non-immune populations in EOCRC versus LOCRC. Gene set enrichment analysis identified increased expression of splicing machinery in EOCRC. We further found differences in alternative splicing (AS) events, including one within the long non-coding RNA, HOTAIRM1. Additional analysis of AS found seven events specific to EOCRC that encode potential neoantigens. Conclusion: Our transcriptome analyses identified genetic and molecular features specific to EOCRC which may inform future screening, development of prognostic indicators, and novel drug targets. Overall design: Gene expression profiles of 22 surgically resected tumors and patient-matched adjacent colonic segments from colorectal cancer patients were generated with RNA-sequencing.
2 PRJNA1055547; GEO: GSE251845
3 Transcriptome or Gene expression
The lower_underscore_nospaces() function will help us
standardize the format of character-type data.
# Define the lower_underscore_nospaces() function
lower_underscore_nospaces <- function(df) {
non_numeric_col <- names(df)[!sapply(df, is.numeric)]
for (col in non_numeric_col) {
df[[col]] <- tolower(df[[col]]) # lowercase
df[[col]] <- gsub("[^a-z0-9]+", "_", df[[col]]) # replace non-alphanumeric with underscores
df[[col]] <- gsub("_$", "", df[[col]]) # remove trailing underscores
}
if (!is.null(names(df))) {
names(df) <- tolower(names(df))
names(df) <- gsub("[^a-z0-9]+", "_", names(df))
names(df) <- gsub("_$", "", names(df))
}
if (!is.null(rownames(df))) {
rownames(df) <- tolower(rownames(df))
rownames(df) <- gsub("[^a-z0-9]+", "_", rownames(df))
rownames(df) <- gsub("_$", "", rownames(df))
}
return(df)
}
# Apply lower_underscore_nospaces to all datasets
df_coldata <- lower_underscore_nospaces(df_coldata)
# Display last changes on dataframes
head(df_coldata[, 1:5], n = 3)
age_at_surgery tissue tumor_stage sample_name sra_study
srr27320655 88.75 rectum NA gsm7989032 srp479528
srr27320656 88.75 rectum 1 gsm7989031 srp479528
srr27320657 68.00 hepatic_flexure NA gsm7989030 srp479528
Since the metadata(se) field in a
SummarizedExperiment object stores data as a named list,
let’s convert df_metadata to this format.
# Convert metadata dataframe into a named list
list_metadata <- as.list(df_metadata$values)
names(list_metadata) <- df_metadata$key
list_metadata
$abstract
[1] "The incidence of colorectal cancer (CRC) has been steadily increasing in younger individuals over the past several decades for reasons that are incompletely defined. Identifying differences in gene expression profiles, or transcriptomes, in early-onset colorectal cancer (EOCRC, < 50 years old) patients versus later-onset colorectal cancer (LOCRC, > 50 years old) patients is one approach to understanding molecular and genetic features that distinguish EOCRC. Methods: We performed RNA-sequencing (RNA-seq) to characterize the transcriptomes of patient-matched tumors and adjacent, uninvolved (normal) colonic segments from EOCRC (n=21) and LOCRC (n=22) patients. The EOCRC and LOCRC cohorts were matched for demographic and clinical characteristics. We used The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) database for validation. We used a series of computational and bioinformatic tools to identify EOCRC-specific differentially expressed genes, molecular pathways, predicted cell populations, differential gene splicing events, and predicted neoantigens. Results: We identified an eight-gene signature in EOCRC comprised of ALDOB, FBXL16, IL1RN, MSLN, RAC3, SLC38A11, WBSCR27 and WNT11, from which we developed a score predictive of overall CRC patient survival. On the entire set of genes identified in normal tissues and tumors, cell type deconvolution analysis predicted a differential abundance of immune and non-immune populations in EOCRC versus LOCRC. Gene set enrichment analysis identified increased expression of splicing machinery in EOCRC. We further found differences in alternative splicing (AS) events, including one within the long non-coding RNA, HOTAIRM1. Additional analysis of AS found seven events specific to EOCRC that encode potential neoantigens. Conclusion: Our transcriptome analyses identified genetic and molecular features specific to EOCRC which may inform future screening, development of prognostic indicators, and novel drug targets. Overall design: Gene expression profiles of 22 surgically resected tumors and patient-matched adjacent colonic segments from colorectal cancer patients were generated with RNA-sequencing."
$Accession
[1] "PRJNA1055547; GEO: GSE251845"
$`Data Type`
[1] "Transcriptome or Gene expression"
$Scope
[1] "Multiisolate"
$Organism
[1] "Homo sapiens[Taxonomy ID: 9606]Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo; Homo sapiens"
$Publications
[1] "Marx OM et al., \"Identification of differentially expressed genes and splicing events in early-onset colorectal cancer.\", Front Oncol, 2024;14:1365762"
$Grants
[1] "\"Wnt/beta-catenin signaling in early-onset colorectal cancer\" (Grant ID R03 CA279861, National Cancer Institute)"
$Submission
[1] "Registration date: 21-Dec-2023\nBiochemistry, Colorectal Surgery, Penn State College of Medicine"
$Relevance
[1] "Medical"
Now let’s convert R regular dataframes into Bioconductor S4 DataFrame objects.
# Convert regular R dataframes to Bioconductor S4 DataFrame objects
colData <- DataFrame(df_coldata)
# Visualize data converted to S4 DataFrame S4 type
print(colData)
DataFrame with 44 rows and 32 columns
age_at_surgery tissue tumor_stage sample_name sra_study
<numeric> <character> <integer> <character> <character>
assay_type avgspotlen bases bioproject biosample
<character> <integer> <numeric> <character> <character>
bytes center_name collection_date consent
<integer> <character> <character> <character>
datastore_filetype datastore_provider datastore_region
<character> <character> <character>
experiment geo_loc_name_country geo_loc_name_country_continent
<character> <character> <character>
geo_loc_name instrument library_name librarylayout
<character> <character> <character> <character>
libraryselection librarysource organism platform
<character> <character> <character> <character>
releasedate create_date version source_name
<character> <character> <integer> <character>
[ reached 'max' / getOption("max.print") -- omitted 11 rows ]
and verify that their dimensions match.
# Check dimmensions of each dataset
cat("Dimensions of counts assay:", dim(assay(se, "counts")), "\n", "Dimensions of colData:",
dim(colData), "\n", "Dimensions of metadata:", length(list_metadata), "\n")
Dimensions of counts assay: 78428 44
Dimensions of colData: 44 32
Dimensions of metadata: 9
Finally, let’s update the SummarizedExperiment object to
include information on colData and metadata
fields.
# Update SE object
colData(se) <- colData
# Add metadata
metadata(se) <- list_metadata
se
class: SummarizedExperiment
dim: 78428 44
metadata(9): abstract Accession ... Submission Relevance
assays(5): counts counts_length_scaled counts_scaled lengths tpm
rownames(78428): ENSG00000000003.17 ENSG00000000005.6 ...
ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(32): age_at_surgery tissue ... version source_name
We’re on the right track. To move on, let’s include a column named
condition, which will convert the tumor_stage
column of colData into a binary factor variable with levels
normal and tumor. A good practice in R is that
the first level be the reference one (e.g. normal, control, or untreated
samples). This will come handy when conducting the differential
expression analysis with the volcano plots.
# Create new colData column named condition
colData(se)$condition <- ifelse(is.na(colData(se)$tumor_stage), "normal_locrc", "tumor_locrc")
colData(se)$condition <- as.factor(colData(se)$condition)
colData(se)$condition <- relevel(colData(se)$condition, ref = "normal_locrc")
colData(se)$condition
[1] normal_locrc tumor_locrc normal_locrc tumor_locrc normal_locrc
[6] tumor_locrc normal_locrc tumor_locrc normal_locrc tumor_locrc
[11] normal_locrc tumor_locrc normal_locrc tumor_locrc normal_locrc
[16] tumor_locrc normal_locrc tumor_locrc normal_locrc tumor_locrc
[21] normal_locrc tumor_locrc normal_locrc tumor_locrc normal_locrc
[26] tumor_locrc normal_locrc tumor_locrc normal_locrc tumor_locrc
[31] normal_locrc tumor_locrc normal_locrc tumor_locrc normal_locrc
[36] tumor_locrc normal_locrc tumor_locrc normal_locrc tumor_locrc
[41] normal_locrc tumor_locrc normal_locrc tumor_locrc
Levels: normal_locrc tumor_locrc
We can now save the se object for reuse, eliminating the
need to rerun this section.
# Save the updated `se` object to a new file
saveRDS(se, file = "gene.SummarizedExperiment.metadata_44.rds")
A DESeqDataSet (dds) object encapsulates all components
needed for DE analysis. It ensures statistical correctness and handles
data integrity by combining:
The first step to convert our se into
DESeqDataSet is to ensure that count data from the assay be
a matrix data type. Currently, assay(se) is dataframe class
containing lists.
# Check class and mode of each assay in se
sapply(assays(se), class)
counts counts_length_scaled counts_scaled
"data.frame" "data.frame" "data.frame"
lengths tpm
"data.frame" "data.frame"
sapply(assays(se), mode)
counts counts_length_scaled counts_scaled
"list" "list" "list"
lengths tpm
"list" "list"
Let’s convert assay(se, "counts") into a numeric matrix
and round count data.
# Extract the counts assay as a matrix
counts_mat <- as.matrix(assay(se, "counts"))
assay(se, "counts") <- round(counts_mat)
# Keep only the 'counts' assay
assays(se) <- assays(se)["counts"]
# Check class and mode of each assay in se
sapply(assays(se), class)
counts
[1,] "matrix"
[2,] "array"
sapply(assays(se), mode)
counts
"numeric"
Now, we’ll remove genes with less than 10 reads on all samples. This
practice will save computational resources when conducting the DE
analysis. It is preferable to run this task before converting from
se to a DESeqDataSet due to memory
constraints.
gene_count_10plus <- rowSums(assay(se)) > 10
se_filtered <- se[gene_count_10plus, ]
se_filtered
class: SummarizedExperiment
dim: 37241 44
metadata(9): abstract Accession ... Submission Relevance
assays(1): counts
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(33): age_at_surgery tissue ... source_name condition
Finally, we construct DESeqDataSet considering
condition (normal vs. tumor) as part of our experimental
design.
# Create a DESeqDataSet data object
deseq2_raw <- DESeqDataSet(se_filtered, design = ~1)
# Display first lines of DESeqDataSet data object
class(deseq2_raw)
[1] "DESeqDataSet"
attr(,"package")
[1] "DESeq2"
head(counts(deseq2_raw)[, 1:5], n = 3)
srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17 1398 6338 1824 1224 944
ENSG00000000005.6 6 79 38 3 299
ENSG00000000419.15 633 4414 916 1017 637
In essence, we have the same data but now it is a
DESeqDataSet data object.
In this section, we explore in more detail the content of the
DESeqDataSet data object.
# Display DESeqDataSet data object
deseq2_raw
class: DESeqDataSet
dim: 37241 44
metadata(10): abstract Accession ... Relevance version
assays(1): counts
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(33): age_at_surgery tissue ... source_name condition
# Show first lines
head(counts(deseq2_raw)[, 1:5], n = 3)
srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17 1398 6338 1824 1224 944
ENSG00000000005.6 6 79 38 3 299
ENSG00000000419.15 633 4414 916 1017 637
# Validate NAs
X <- counts(deseq2_raw)
X <- as.data.frame(X) # convert to data.frame
print(table(is.na(X)))
FALSE
1638604
The dataset comprises 37,241 genes and 44 samples, for a total of 1,638,604 read counts, and contains no missing values (NAs).
To compare raw counts across 44 samples in a DESeq2 analysis, we need to account for library size differences. This is where normalization by sequencing depth comes handy to correct for differences in total read counts per sample. This is the baseline normalization for differential expression testing.
Let’s calculate the size factor per sample.
# Display raw counts
head(counts(deseq2_raw)[, 1:5], 5)
srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17 1398 6338 1824 1224 944
ENSG00000000005.6 6 79 38 3 299
ENSG00000000419.15 633 4414 916 1017 637
ENSG00000000457.15 434 741 604 553 274
ENSG00000000460.18 88 662 108 381 47
# Sum all counts per sample counts(deseq2_raw) is equivalent to assay(se,
# 'counts') from SummarizedExperiment print(colSums(counts(deseq2_raw))[1:5])
# Calculate the size factor and add it to the dataset It normalizes for
# sequencing depth differences to compare expression between samples fairly
# DESeq2 will automatically account for these depth differences in all
# downstream analyses using these size factors.
deseq2_raw <- estimateSizeFactors(deseq2_raw)
print(sizeFactors(deseq2_raw)[1:5])
srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
0.6272915 1.2203391 1.1607579 1.0376846 0.7745416
Remember that,
The sizeFactor column was added to the coldData field of
the DESeqDataSet data object.
# Check that the sizeFactor column was added to colData
deseq2_raw
class: DESeqDataSet
dim: 37241 44
metadata(10): abstract Accession ... Relevance version
assays(1): counts
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(34): age_at_surgery tissue ... condition sizeFactor
Now, all raw counts across 44 samples have accounted for library size differences.
# Now take a look at the sum of the total depth after normalization
# print(colSums(counts(deseq2_raw, normalized=TRUE))[1:5])
# To retrieve normalized read counts
counts_normalized <- counts(deseq2_raw, normalized = TRUE)
# Display normalized counts
head(counts_normalized[, 1:5], 5)
srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17 2228.629003 5193.63847 1571.38708 1179.549163 1218.78541
ENSG00000000005.6 9.564931 64.73611 32.73723 2.891052 386.03479
ENSG00000000419.15 1009.100257 3617.02749 789.13957 980.066584 822.42194
ENSG00000000457.15 691.863367 607.20828 520.34967 532.917228 353.75763
ENSG00000000460.18 140.285660 542.47218 93.04266 367.163588 60.68105
For RNA-seq counts, the expected variance grows with the mean, making low-count genes contribute excessive noise or making high-count genes dominate. Therefore, it is required to implement variance stabilization so that each gene contributes equally to overall variance patterns making sample relationships and biological signals easier to detect.
It consists of a raw counts transformation using the formula
log2(count + 1). It is useful for reducing the effect of
large counts making distributions more symmetric. Use this
transformation for a quick comparisons across samples and conducting
exploratory visualization with PCA and heatmaps.
DESeq2 offers two transformations to stabilize variance and thus transform data to become approximately homoskedastic (to see a flat trend in the meanSdPlot): the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend (Anders and Huber 2010), and the regularized-logarithm transformation or rlog (Love, Huber, and Anders 2014).
“VST is much faster to compute and is less sensitive to high count outliers than the rlog. Also, VST is recommended for medium-to-large datasets (n > 30). rlog tends to work well on small datasets (n < 30), potentially outperforming VST when there is a wide range of sequencing depth across samples (an order of magnitude difference.)”
# Implement VST
counts_vst <- vst(deseq2_raw, blind = TRUE)
# counts_vst is a DESeq2 object type head(assay(counts_vst), 3)
# Implement rlog
counts_rlog <- rlog(deseq2_raw, blind = TRUE)
# counts_rlog is a DESeq2 object type head(assay(counts_rlog), 3)
We’ll continue an Exploratory Data Analysis (EDA) considering variance-stabilized-transformed counts.
# Plot PCA on condition type plotPCA(counts_vst) plotPCA(counts_vst,
# intgroup=c('tissue','condition'))
# Plot PCA on tissue type plotPCA(counts_vst, intgroup='tissue')
# Plot PCA on both condition and tissue type plotPCA(counts_vst,
# intgroup=c('condition', 'tissue'))
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6
Standard deviation 84.2860 55.65071 49.79084 41.97246 36.31274 34.32710
PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 32.62438 30.56495 30.29262 30.06734 27.85560 27.13645
PC13 PC14 PC15 PC16 PC17 PC18
Standard deviation 26.33277 26.23311 25.6769 25.48967 24.39372 24.26263
PC19 PC20 PC21 PC22 PC23 PC24
Standard deviation 24.01217 23.82649 23.4009 23.3180 22.82388 22.6720
PC25 PC26 PC27 PC28 PC29 PC30
Standard deviation 22.51250 22.18107 21.92296 21.76032 21.53951 21.10730
PC31 PC32 PC33 PC34 PC35 PC36
Standard deviation 20.83664 20.73212 20.20542 20.08440 19.91107 19.76794
PC37 PC38 PC39 PC40 PC41 PC42
Standard deviation 19.59697 19.25672 19.04298 18.87399 18.77049 18.60127
PC43 PC44
Standard deviation 18.09094 2.234e-13
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
PC1 and PC2 components explains approximately 27.4% of the total variability. We also identified 3 potential outliers from both conditions, which corresponds to rectum samples.
The Mahalanobis distance (MD) can identify outliers beyond the 99.9th
percentile. According to this criterion, only the srr27320659 sample
appears to be a potential outlier. However, we decided not to remove it
since its paired sample, srr27320660, does not represent an outlier for
the tumor_locrc condition.
# Detect outliers in PCA
pc_scores <- pca$x[, 1:5] # Use first PCs
d <- mahalanobis(pc_scores, colMeans(pc_scores), cov(pc_scores))
# Set threshold at 99.9 percentile
thr <- quantile(d, 0.999)
# Enumerate outliers
outliers <- names(d[d > thr])
outliers
[1] "srr27320659"
Before conducting the DESeq2 analysis, let’s build the design matrix and evaluate which variables are potential confounded ones (condition, tissue, age_at_surgery).
Since each tissue has distinct transcriptional profiles and the
normal vs tumor comparison is made within the same tissue,
tissue is a potential confounded variable.
Now after the data exploration we can run the differential expression pipeline on the raw counts with a single call to the function DESeq. The null hypothesis is that there is no systematic difference between the average read count values of the different conditions for a given gene. We will calculate the fold change of read counts, assuming de differences in sequencing depth and variability. We will test normal tissue versus cancerous tissue (normal used as denominator for the fold change calculation)
# Run DGE DESeq2 pipeline Display se_filtered se_filtered
# Create a DESeqDataSet data object
dds <- DESeqDataSet(se_filtered, design = ~condition)
# Account for sequencing depth differences to compare expression between
# samples fairly
dds <- estimateSizeFactors(dds)
# Run DGE DESeq2 pipeline
dds_DGE <- DESeq(dds)
dds_DGE
class: DESeqDataSet
dim: 37241 44
metadata(10): abstract Accession ... Relevance version
assays(6): counts mu ... replaceCounts replaceCooks
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
ENSG00000310592.1 ENSG00000310593.1
rowData names(26): transcript_id gene_id ... maxCooks replace
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(35): age_at_surgery tissue ... sizeFactor replaceable
We proceed to extract results obtained from the DESeq2 pipeline to obtain the base means across samples, the log2 fold variation, standard errors, p-value and p-adjusted.
# Get results from the DESeq2 pipeline
dds_DGE_results <- results(dds_DGE)
head(dds_DGE_results, n = 5)
log2 fold change (MLE): condition tumor locrc vs normal locrc
Wald test p-value: condition tumor locrc vs normal locrc
DataFrame with 5 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003.17 3583.2476 0.534997 0.1701837 3.14364 1.66858e-03
ENSG00000000005.6 36.6452 0.444771 0.4429624 1.00408 3.15339e-01
ENSG00000000419.15 1944.3084 1.410766 0.2075048 6.79872 1.05556e-11
ENSG00000000457.15 564.2597 -0.342297 0.0938915 -3.64567 2.66701e-04
ENSG00000000460.18 259.4372 1.826449 0.1806765 10.10895 5.04227e-24
padj
<numeric>
ENSG00000000003.17 4.82460e-03
ENSG00000000005.6 4.31815e-01
ENSG00000000419.15 1.19006e-10
ENSG00000000457.15 9.23339e-04
ENSG00000000460.18 3.53990e-22
# Inspect the meaning of the columns from the results object
mcols(dds_DGE_results, use.names = TRUE)
DataFrame with 6 rows and 2 columns
type description
<character> <character>
baseMean intermediate mean of normalized c..
log2FoldChange results log2 fold change (ML..
lfcSE results standard error: cond..
stat results Wald statistic: cond..
pvalue results Wald test p-value: c..
padj results BH adjusted p-values
The results object contains the following features: baseMean, log2FoldChange, lfcSE, stat, pvalue, and padj. Let’s get a summary of the results object. Now, let’s obtain a summary of the results object.
# Summary of DGE
summary(dds_DGE_results)
out of 37216 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 9180, 25%
LFC < 0 (down) : 8837, 24%
outliers [1] : 0, 0%
low counts [2] : 2911, 7.8%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
After filtering those genes with p-adjusted < 0.05, we obtained 16.013 significant genes.
# Filter genes with p-adjusted value < 0.05 The results object behaves like a
# dataframe
table(dds_DGE_results$padj < 0.05)
FALSE TRUE
18317 16013
and when we order those significant genes by p-adjusted value:
# Select significant genes with a p-adjusted value lower than 0.05
resSig <- subset(dds_DGE_results, padj < 0.05)
# Order significant genes by p-adjusted value
resSig <- resSig[order(resSig$padj), ]
head(resSig, n = 5)
log2 fold change (MLE): condition tumor locrc vs normal locrc
Wald test p-value: condition tumor locrc vs normal locrc
DataFrame with 5 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000175832.14 777.653 5.90807 0.272208 21.7043 1.86985e-104
ENSG00000164379.7 554.643 6.81300 0.378529 17.9986 1.99725e-72
ENSG00000173898.17 335.155 3.84378 0.224648 17.1103 1.24430e-65
ENSG00000287828.2 117.882 3.74329 0.230517 16.2387 2.68670e-59
ENSG00000111110.13 475.617 2.84100 0.179182 15.8554 1.28977e-56
padj
<numeric>
ENSG00000175832.14 6.41918e-100
ENSG00000164379.7 3.42828e-68
ENSG00000173898.17 1.42389e-61
ENSG00000287828.2 2.30586e-55
ENSG00000111110.13 8.85559e-53
Let’s check those genes with the strongest down-regulation and up-regulation, respectively.
# Order significant genes with the strongest down-regulation
head(resSig[order(resSig$log2FoldChange), ])
log2 fold change (MLE): condition tumor locrc vs normal locrc
Wald test p-value: condition tumor locrc vs normal locrc
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000181092.11 37.66107 -8.31119 0.921877 -9.01551 1.95963e-19
ENSG00000034971.18 41.97757 -8.07260 0.646220 -12.49204 8.25098e-36
ENSG00000167916.5 26.97167 -6.97326 0.764260 -9.12420 7.22703e-20
ENSG00000286765.2 17.45882 -6.93208 1.268940 -5.46289 4.68436e-08
ENSG00000241158.7 43.04531 -6.81127 0.723505 -9.41427 4.76401e-21
ENSG00000298053.1 7.38361 -6.48845 0.732932 -8.85272 8.54101e-19
padj
<numeric>
ENSG00000181092.11 6.81602e-18
ENSG00000034971.18 3.88022e-33
ENSG00000167916.5 2.67788e-18
ENSG00000286765.2 3.09496e-07
ENSG00000241158.7 2.12401e-19
ENSG00000298053.1 2.67530e-17
# Order significant genes with the strongest up-regulation
head(resSig[order(resSig$log2FoldChange, decreasing = TRUE), ])
log2 fold change (MLE): condition tumor locrc vs normal locrc
Wald test p-value: condition tumor locrc vs normal locrc
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000298768.1 65.6359 8.92971 0.826426 10.80521 3.25199e-27
ENSG00000185269.13 222.6813 8.72954 0.647885 13.47390 2.22783e-41
ENSG00000260573.5 48.3050 8.62087 0.801541 10.75537 5.59091e-27
ENSG00000230316.8 38.2773 8.41513 0.973221 8.64667 5.30218e-18
ENSG00000214039.11 301.1991 8.23944 0.561701 14.66873 1.02235e-48
ENSG00000167755.16 159.7259 8.20499 0.846935 9.68786 3.39555e-22
padj
<numeric>
ENSG00000298768.1 3.75895e-25
ENSG00000185269.13 2.83264e-38
ENSG00000260573.5 6.13214e-25
ENSG00000230316.8 1.47507e-16
ENSG00000214039.11 3.19066e-45
ENSG00000167755.16 1.81008e-20
# Save significant DGE results to file
write.table(resSig, file = "DESeq2_sig_genes_normal_vs_cancer_44_locrc.tab", sep = "\t",
quote = FALSE, row.names = FALSE)
To visualize the counts for ENSG00000175832.14 gene, which showed the
lowest p-adjusted value of 6.41918e-100, we’ll use the plotCounts
function over the feature condition to get the normalized
counts for this gene.
*An MA-plot (Dudoit et al. 2002) provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”. This plot allows us to evaluate fold changes and the distribution around the mean expression.
Before making the MA-plot, we use the lfcShrink function to shrink the log2 fold changes for the comparison of dex treated vs untreated samples. There are three types of shrinkage estimators in DESeq2, which are covered in the DESeq2 vignette. log2 fold shrinkage remove the noise preserving the large differences.*
Volcano plot helps to visualize significant differential expressed genes showing the log2 fold change and the significance (p-adjusted).
To export a .tex table having the top significantly differentiated genes of the LOCRC cohort
Gene baseMean log2FoldChange padj
ENSG00000175832 ETV4 777.7 5.908 6.42e-100
ENSG00000164379 FOXQ1 554.6 6.813 3.43e-68
ENSG00000173898 SPTBN2 335.2 3.844 1.42e-61
ENSG00000287828 <NA> 117.9 3.743 2.31e-55
ENSG00000111110 PPM1H 475.6 2.841 8.86e-53
ENSG00000165816 VWA2 447.5 4.612 4.59e-49
ENSG00000164283 ESM1 72.8 5.750 2.78e-48
ENSG00000015413 DPEP1 1834.3 6.596 1.12e-47
ENSG00000120708 TGFBI 8131.2 2.903 1.07e-46
ENSG00000162073 PAQR4 378.3 1.920 1.99e-45
ENSG00000214039 LINC02418 301.2 8.239 3.19e-45
ENSG00000186462 NAP1L2 52.1 -3.451 4.65e-45
[ reached 'max' / getOption("max.print") -- omitted 18 rows ]
In this section, we will add the gene symbol, gene name, KEGG pathway, and Entrez ID to our dataset to provide more informative annotations for the DESeq analysis, which currently uses ENSEMBL gene ID nomenclature (e.g., ENSG00000175832).
# Load modules and libraries
BiocManager::install("AnnotationDbi")
BiocManager::install("org.Hs.eg.db")
library("AnnotationDbi")
library("org.Hs.eg.db")
# Check all potential terms that could be retrieved with AnnotationDbi using
# mapsIds()
columns(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
# Review the most significant genes from our study Previously, we've saved a in
# a .tab file the significant DGE results file =
# 'DESeq2_sig_genes_normal_vs_cancer_44_locrc.tab'
head(resSig)
log2 fold change (MLE): condition tumor locrc vs normal locrc
Wald test p-value: condition tumor locrc vs normal locrc
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000175832.14 777.653 5.90807 0.272208 21.7043 1.86985e-104
ENSG00000164379.7 554.643 6.81300 0.378529 17.9986 1.99725e-72
ENSG00000173898.17 335.155 3.84378 0.224648 17.1103 1.24430e-65
ENSG00000287828.2 117.882 3.74329 0.230517 16.2387 2.68670e-59
ENSG00000111110.13 475.617 2.84100 0.179182 15.8554 1.28977e-56
ENSG00000165816.14 447.514 4.61193 0.301494 15.2969 8.01942e-53
padj
<numeric>
ENSG00000175832.14 6.41918e-100
ENSG00000164379.7 3.42828e-68
ENSG00000173898.17 1.42389e-61
ENSG00000287828.2 2.30586e-55
ENSG00000111110.13 8.85559e-53
ENSG00000165816.14 4.58844e-49
We proceed by adding annotations starting from ENSEMBL to Entrez and from there to KEGG pathways, gene symbols and gene names. This is because, all major annotation databases (KEGG, org.Hs.eg.db, clusterProfiler) are indexed by Entrez, so doing the conversion first dramatically reduces NA values and gives more complete annotations.
# Strip version
rownames(resSig) <- sub("\\..*$", "", rownames(resSig))
# Add gene symbol column
resSig$SYMBOL = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "SYMBOL",
keytype = "ENSEMBL", multiVals = "first")
# Add gene name column
resSig$GENENAME = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "GENENAME",
keytype = "ENSEMBL", multiVals = "first")
# Add gene Entrez ID column
resSig$ENTREZID = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "ENTREZID",
keytype = "ENSEMBL", multiVals = "first")
# Add gene EGG functional pathway column
resSig$PATH = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "PATH", keytype = "ENSEMBL",
multiVals = "first")
# Add gene ENSEMBL ID column
resSig$ENSEMBL <- rownames(resSig)
# Convert to dataframe
resSig_df <- as.data.frame(resSig)
print(head(resSig_df), 5)
baseMean log2FoldChange lfcSE stat pvalue
ENSG00000175832 777.6527 5.908067 0.2722077 21.70427 1.869846e-104
ENSG00000164379 554.6433 6.813000 0.3785290 17.99862 1.997251e-72
ENSG00000173898 335.1553 3.843781 0.2246476 17.11027 1.244300e-65
ENSG00000287828 117.8818 3.743293 0.2305172 16.23867 2.686703e-59
padj SYMBOL GENENAME
ENSG00000175832 6.419180e-100 ETV4 ETS variant transcription factor 4
ENSG00000164379 3.428282e-68 FOXQ1 forkhead box Q1
ENSG00000173898 1.423894e-61 SPTBN2 spectrin beta, non-erythrocytic 2
ENSG00000287828 2.305863e-55 <NA> <NA>
ENTREZID PATH ENSEMBL
ENSG00000175832 2118 <NA> ENSG00000175832
ENSG00000164379 94234 <NA> ENSG00000164379
ENSG00000173898 6712 <NA> ENSG00000173898
ENSG00000287828 <NA> <NA> ENSG00000287828
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# Save data frame including annotations
write.table(resSig_df, file = ("DESeq2_sig_genes_normal_vs_cancer_44_locrc_annotated.tab"),
row.names = F, col.names = TRUE, sep = "\t")
They often reflect: biological heterogeneity, batch effects, cell-type composition differences, high expression strength, and technical noise.
# Load modules and libraries
BiocManager::install("clusterProfiler")
BiocManager::install("ggnewscale")
library("clusterProfiler")
library("enrichplot")
library("ggnewscale")
[1] 16013 11
ID Description Count
GO:0050789 GO:0050789 regulation of biological process 6611
GO:0008152 GO:0008152 metabolic process 6460
GO:0050794 GO:0050794 regulation of cellular process 6372
GO:0051716 GO:0051716 cellular response to stimulus 4242
GO:0071840 GO:0071840 cellular component organization or biogenesis 4223
GO:0048518 GO:0048518 positive regulation of biological process 3765
GO:0007154 GO:0007154 cell communication 3693
GO:0023052 GO:0023052 signaling 3680
GO:0048856 GO:0048856 anatomical structure development 3535
GO:0048522 GO:0048522 positive regulation of cellular process 3533
GO:0007165 GO:0007165 signal transduction 3388
GO:0048519 GO:0048519 negative regulation of biological process 3263
GeneRatio
GO:0050789 6611/12698
GO:0008152 6460/12698
GO:0050794 6372/12698
GO:0051716 4242/12698
GO:0071840 4223/12698
GO:0048518 3765/12698
GO:0007154 3693/12698
GO:0023052 3680/12698
GO:0048856 3535/12698
GO:0048522 3533/12698
GO:0007165 3388/12698
GO:0048519 3263/12698
[ reached 'max' / getOption("max.print") -- omitted 13 rows ]
# Convert to dataframe
ego_df <- as.data.frame(ego)
# Visualize the last columns
head(ego_df, 25)
ID Description GeneRatio BgRatio
GO:0042254 GO:0042254 ribosome biogenesis 262/9826 318/18860
GO:0022613 GO:0022613 ribonucleoprotein complex biogenesis 373/9826 496/18860
GO:0006364 GO:0006364 rRNA processing 186/9826 223/18860
GO:0016072 GO:0016072 rRNA metabolic process 216/9826 270/18860
RichFactor FoldEnrichment zScore pvalue p.adjust
GO:0042254 0.8238994 1.581390 10.904627 4.319790e-30 2.754730e-26
GO:0022613 0.7520161 1.443418 10.437057 7.951835e-27 2.535443e-23
GO:0006364 0.8340807 1.600932 9.414511 5.695650e-23 1.210705e-19
GO:0016072 0.8000000 1.535518 9.243234 8.029223e-22 1.280059e-18
qvalue
GO:0042254 2.114878e-26
GO:0022613 1.946525e-23
GO:0006364 9.294902e-20
GO:0016072 9.827346e-19
geneID
GO:0042254 DKC1/EXOSC2/UTP25/HEATR1/NLE1/NOP58/DDX55/RPL7L1/BOP1/DDX31/NAT10/RIOK3/LYAR/NOL8/GTF3A/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/UTP14A/UTP20/BRIX1/PRKDC/PA2G4/GAR1/SUV39H1/RRP15/KAT2B/SPOUT1/NOP14/DDX27/URB1/LAS1L/NPM3/DCAF13/TMA16/MTERF3/RRP9/NOB1/NSUN5/PDCD11/DDX18/WDR36/PNO1/RIOK1/GTPBP4/LTV1/RRS1/AIRIM/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NPM1/NOL6/RPF2/DDX56/FBL/ERCC2/RIOX2/UTP4/RPUSD4/EIF1AX/NOLC1/GTPBP10/DHX37/RAN/NOP2/UTP23/PAK1IP1/SDAD1/NHP2/DDX21/MYBBP1A/MPHOSPH10/NOL10/TSR1/XPO1/URB2/BMS1/PES1/MTG2/RRP1B/RCL1/NOP56/UTP15/MRM2/SIRT7/RPUSD1/BUD23/DIS3/GEMIN4/WDR74/EBNA1BP2/CUL4A/PPAN/TRMT61B/CHD7/DDX28/GNL3L/NOM1/TFB1M/MRTO4/ZNHIT6/DDX52/DIMT1/SURF6/MAK16/NOC4L/NOL11/RRP1/NOC2L/AATF/POP7/RPS21/HEATR3/EXOSC9/IMP4/FASTKD2/C1QBP/UTP6/PWP1/NIP7/DDX54/XRN2/EMG1/EIF5B/DHX30/KRI1/NUDT16/NUP88/DDX51/ZCCHC4/REXO4/TRMT2B/METTL5/TRMT112/WDR18/KRR1/NAF1/NMD3/NOP16/MDN1/RRP8/RPS19/AFG2A/TFB2M/METTL16/NOL7/MRPS2/MALSU1/LSM6/XRCC5/ISG20/REXO5/RPS4X/RPL10L/RPS7/GNL2/ZNF593/ISG20L2/DDX47/EIF2A/GRWD1/EIF5/RPL7A/RPL35/RBIS/ESF1/PELP1/CUL4B/TBL3/NSUN3/UTP3/RPS6/RPL14/UTP11/ABT1/RPS5/ERI1/RPLP0/MYG1/SNU13/RPL26L1/EFL1/RPLP0P6/RPL7/RRP7BP/RBFA/USP36/EXOSC4/PIN4/RPUSD2/ERAL1/RPS15/FTSJ3/FSAF1/RPL38/DDX49/RPS16/MPHOSPH6/EIF6/MTREX/LSG1/MRM1/RPSA/DNTTIP2/CDKN2A/POP5/GLUL/RRP7A/RCC1L/EIF4A3/PIH1D2/RPL27/SBDS/RPS9/RPS11/RPS12/RPL24/RPL35A/RPS17/DROSHA/RPS27A/RRN3/ZNF622/MCAT/EXOSC6/EXOSC1/RPL5/RPS15A/RPS24/TSR2/AFG2B/RPS8/NSA2/RPS23/RPS14/WBP11/NGDN/NOA1/FCF1/RBM34/ZNHIT3/RSL24D1/RIOK2/RPP38/TENT4B/SLX9/RPS25/RRP36/RPS28/RPP30/RPSA2/MRM3
GO:0022613 RUVBL1/DKC1/EXOSC2/UTP25/NUFIP1/PRPF19/WDR77/HEATR1/NLE1/NOP58/DDX55/RPL7L1/BOP1/DDX31/NAT10/RIOK3/LYAR/NOL8/GTF3A/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/SRPK1/UTP14A/UTP20/BRIX1/PRKDC/MCTS2/PA2G4/GAR1/SUV39H1/GEMIN5/RRP15/KAT2B/SPOUT1/RUVBL2/NOP14/DDX27/URB1/LAS1L/NPM3/DCAF13/EIF3B/EIF2S3/TMA16/MTERF3/HSP90AB1/RRP9/NOB1/NSUN5/DENR/SF3B3/PDCD11/SNRPF/DDX18/WDR36/PNO1/ATR/RIOK1/GTPBP4/LTV1/RRS1/AIRIM/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NPM1/NOL6/RPF2/DDX56/FBL/ERCC2/RIOX2/SRSF5/UTP4/CLNS1A/RPUSD4/EIF1AX/NOLC1/GTPBP10/DHX37/RAN/NOP2/EIF2S2/UTP23/PAK1IP1/SDAD1/NHP2/RBMX2/DDX21/SNRPB/MYBBP1A/MPHOSPH10/NOL10/TGS1/TSR1/XPO1/URB2/BMS1/PES1/MTG2/NOPCHAP1/GEMIN6/CELF4/RRP1B/RCL1/NOP56/DDX20/UTP15/MRM2/SIRT7/RPUSD1/BUD23/TARBP2/DIS3/GEMIN4/SRSF1/WDR74/EBNA1BP2/CUL4A/PPAN/PRPF6/TRMT61B/CHD7/DDX28/SRSF6/GNL3L/PTGES3/NOM1/TFB1M/MRTO4/GEMIN8/ZNHIT6/DDX52/USP39/DIMT1/SURF6/AAR2/MAK16/NOC4L/NOL11/RRP1/NOC2L/AATF/POP7/RPS21/HEATR3/PRPF31/EXOSC9/IMP4/FASTKD2/C1QBP/UTP6/SNRPA1/PWP1/PRPF3/NIP7/DDX54/XRN2/SNRPE/EMG1/SF3A3/EIF5B/DHX30/LSM2/KRI1/STRAP/NUDT16/NUP88/DDX51/SNRNP200/ZCCHC4/REXO4/SNRPC/TRMT2B/METTL5/SNRPB2/TRMT112/SNRPD2/WDR18/KRR1/SRPK2/NAF1/HSP90AB3P/U2AF2/NMD3/SNRPG/NOP16/MDN1/RRP8/EIF3M/SNRPD1/RPS19/MCTS1/AFG2A/TFB2M/METTL16/NOL7/DHX9/SRSF9/MRPS2/MALSU1/LSM6/XRCC5/ISG20/DDX39B/EIF3J/REXO5/RPS4X/RPL10L/RPS7/GNL2/HSP90AA1/PRKRA/SRSF10/ZNF593/KHDC4/ISG20L2/DDX47/EIF2A/COIL/GRWD1/SMN1/EIF3G/DDX1/LUC7L2/EIF5/CELF6/EIF3I/RPL7A/GCFC2/RPL35/RBIS/ESF1/PELP1/CUL4B/DDX46/EIF3H/TBL3/EIF3D/EIF3E/NSUN3/UTP3/AGO2/ZRSR2/PRMT5/RPS6/RPL14/EIF4B/ISY1/UTP11/ABT1/RPS5/SNRPD3/ERI1/RPLP0/MYG1/SNU13/SART3/HTATSF1/RPL26L1/EFL1/SHQ1/RPLP0P6/RPL7/NOL3/PRMT7/SNIP1/RRP7BP/RBFA/USP36/XAB2/NCBP1/EXOSC4/EIF3C/PIN4/RPUSD2/ERAL1/RPS15/FTSJ3/FSAF1/RPL38/DDX49/RPS16/MPHOSPH6/EIF6/TAF9/MTREX/LSG1/MRM1/SF3A2/RPSA/DNTTIP2/CDKN2A/POP5/GLUL/RRP7A/RCC1L/EIF4A3/CELF3/GEMIN2/PIH1D2/RPL27/SBDS/RPS9/RPS11/RPS12/RPL24/RPL35A/SLU7/RPS17/DROSHA/SRPK3/PRP4K/RPL13A/RPS27A/RRN3/ZNF622/MCAT/EXOSC6/EXOSC1/DDX23/RPL5/RPS15A/RPS24/TSR2/AFG2B/RPS8/LUC7L/LUC7L3/NSA2/GEMIN7/PHF5A/RSRP1/PUF60/EIF3A/SF3B6/RPS23/RPS14/BUD13/WBP11/NGDN/NOA1/FCF1/DDX42/RBM34/ZNHIT3/SFSWAP/RSL24D1/DICER1/PRPF39/RIOK2/PSIP1/LSM4/RPP38/TENT4B/CELF5/SLX9/RPS25/RRP36/RPS28/RPP30/RPSA2/MRM3
GO:0006364 DKC1/EXOSC2/UTP25/HEATR1/NOP58/DDX55/RPL7L1/BOP1/NAT10/RIOK3/LYAR/NOL8/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/UTP14A/UTP20/BRIX1/PRKDC/PA2G4/GAR1/SUV39H1/RRP15/KAT2B/SPOUT1/NOP14/DDX27/URB1/LAS1L/NPM3/DCAF13/RRP9/NOB1/NSUN5/PDCD11/DDX18/WDR36/RIOK1/GTPBP4/RRS1/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NOL6/RPF2/DDX56/FBL/ERCC2/UTP4/RPUSD4/NOLC1/DHX37/NOP2/UTP23/PAK1IP1/NHP2/DDX21/MPHOSPH10/NOL10/TSR1/BMS1/PES1/RRP1B/RCL1/NOP56/UTP15/MRM2/SIRT7/RPUSD1/BUD23/DIS3/GEMIN4/WDR74/EBNA1BP2/PPAN/TRMT61B/CHD7/TFB1M/MRTO4/ZNHIT6/DDX52/DIMT1/MAK16/NOC4L/NOL11/RRP1/POP7/RPS21/EXOSC9/IMP4/UTP6/PWP1/DDX54/XRN2/EMG1/KRI1/NUDT16/DDX51/ZCCHC4/REXO4/TRMT2B/METTL5/TRMT112/WDR18/KRR1/NAF1/RRP8/RPS19/TFB2M/METTL16/NOL7/LSM6/ISG20/REXO5/RPS7/DDX47/RPL7A/RPL35/ESF1/PELP1/TBL3/NSUN3/UTP3/RPS6/RPL14/UTP11/ABT1/ERI1/MYG1/SNU13/RPL7/RRP7BP/RBFA/USP36/EXOSC4/PIN4/RPUSD2/RPS15/FTSJ3/DDX49/RPS16/MPHOSPH6/EIF6/MTREX/MRM1/CDKN2A/POP5/RRP7A/EIF4A3/PIH1D2/RPL27/SBDS/RPL35A/RPS17/DROSHA/EXOSC6/EXOSC1/RPL5/RPS24/TSR2/RPS8/NSA2/RPS14/WBP11/NGDN/FCF1/RBM34/ZNHIT3/RIOK2/RPP38/TENT4B/SLX9/RPS25/RRP36/RPS28/RPP30/MRM3
GO:0016072 DKC1/EXOSC2/UTP25/POLR1G/HEATR1/POLR1B/NOP58/DDX55/RPL7L1/BOP1/NAT10/RIOK3/LYAR/NOL8/GTF3A/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/UTP14A/UTP20/BRIX1/PRKDC/PA2G4/GAR1/SUV39H1/RRP15/KAT2B/SPOUT1/NOP14/DDX27/URB1/LAS1L/NPM3/POLR1F/DCAF13/RRP9/NOB1/NSUN5/TCOF1/PDCD11/DDX18/WDR36/RIOK1/GTPBP4/RRS1/POLR1A/NCL/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NOL6/RPF2/DDX56/FBL/ERCC2/UTP4/RPUSD4/NIFK/NOLC1/DHX37/NOP2/UTP23/PAK1IP1/NHP2/DDX21/MPHOSPH10/NOL10/TSR1/BMS1/PES1/TAF1B/DDX11/RRP1B/RCL1/NOP56/UTP15/MRM2/SIRT7/POLR1D/GTF3C3/RPUSD1/BUD23/DIS3/GEMIN4/WDR74/EBNA1BP2/PPAN/TRMT61B/CHD7/TFB1M/MRTO4/ZNHIT6/DDX52/GTF3C4/DIMT1/MAK16/NOC4L/NOL11/RRP1/POP7/RPS21/EXOSC9/IMP4/UTP6/PWP1/DDX54/XRN2/EMG1/KRI1/MYC/NUDT16/DDX51/ZCCHC4/REXO4/TP53/TRMT2B/METTL5/POLR1E/TRMT112/WDR18/KRR1/NAF1/RRP8/RPS19/TFB2M/METTL16/NOL7/LSM6/ISG20/REXO5/RPS7/MARS1/XRN1/DDX47/RPL7A/CAVIN1/RPL35/ESF1/PELP1/TBL3/NSUN3/UTP3/RPS6/RPL14/UTP11/ABT1/ERI1/MYG1/SNU13/MACROH2A2/RPL7/SPIN1/RRP7BP/MACROH2A1/RBFA/USP36/EXOSC4/SMARCA4/PIN4/RPUSD2/PELO/RPS15/FTSJ3/MAPT/DDX49/RPS16/MPHOSPH6/EIF6/ERN2/MTREX/DEDD2/MRM1/GTF3C5/CDKN2A/POP5/RRP7A/EIF4A3/PIH1D2/RPL27/SBDS/RPL35A/RPS17/DROSHA/EXOSC6/EXOSC1/DEDD/RPL5/RPS24/TSR2/RPS8/NSA2/RPS14/WBP11/NGDN/FCF1/RBM34/ZNHIT3/RIOK2/RPP38/TENT4B/SLX9/RPS25/RRP36/MTOR/RPS28/RPP30/MRM3
Count
GO:0042254 262
GO:0022613 373
GO:0006364 186
GO:0016072 216
[ reached 'max' / getOption("max.print") -- omitted 21 rows ]
# Get dataframe dimmensions
dim(ego_df)
[1] 869 12
The Gene-Concept Network displays which genes are associated with which biological processes, making it easy to see gene-function relationships. Meanwhile, the enrichment map groups similar terms together. Terms that share genes are connected, making related functions easy to spot.
# Load modules and libraries
library(DOSE)
# Convert gene ID to Symbol
egox <- setReadable(ego, "org.Hs.eg.db", "ENTREZID")
# Generate geneList with Log2FoldChange data
geneList <- resSig$log2FoldChange
names(geneList) <- as.character(unique(resSig$ENTREZID))
geneList <- sort(geneList, decreasing = TRUE)
head(geneList)
5333 147111 4857 10004 100190940 401093
8.929714 8.729543 8.620869 8.415127 8.239437 8.204987
# Network plot of enriched terms CategorySize can be scaled by 'pvalue' or
# 'geneNum'
p1 <- cnetplot(egox, categorySize = "pvalue", foldChange = geneList)
# Display the plot
print(p1)
# Calculate pairwise similarities of enriched terms
egox_pairwise <- pairwise_termsim(egox)
# Creates an enrichment map visualization from pairwise enrichment analysis
# results
p2 <- emapplot(egox_pairwise, layout = "kk")
# Display the plot
print(p2)
# Plot Gene-Concept Network and Enrichment Map in a PDF file
# Plot 1
pdf("gene_concept_network_plot.pdf", width = 12, height = 10)
print(p1)
dev.off()
quartz_off_screen
2
# Plot 2
pdf("enrichment_map.pdf", width = 12, height = 10)
print(p2)
dev.off()
quartz_off_screen
2
# # Start PDF device pdf(file=paste('Merge_GO_Gene_Set_Enrichment_Analysis',
# 'CC', '.pdf', sep='_'), width=25, height=10) # Create the plot
# cowplot::plot_grid(p1, p2, ncol=2, labels=LETTERS[1:3], rel_widths=c(.8, .8,
# 1.2)) # Close PDF device dev.off()
# Get results from the DESeq2 pipeline by contrast
tumor_vs_normal_44_dds <- results(dds_DGE, contrast = c("condition", "tumor_locrc",
"normal_locrc"))
summary(tumor_vs_normal_44_dds)
out of 37216 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 9180, 25%
LFC < 0 (down) : 8837, 24%
outliers [1] : 0, 0%
low counts [2] : 2911, 7.8%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Get differentially expressed genes by contrast
tumor_locrc_vs_normal_locrc:
log2 fold change (MLE): condition tumor_locrc vs normal_locrc
Wald test p-value: condition tumor locrc vs normal locrc
DataFrame with 5 rows and 5 columns
log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000175832.14 5.90807 0.272208 21.7043 1.86985e-104 6.41918e-100
ENSG00000164379.7 6.81300 0.378529 17.9986 1.99725e-72 3.42828e-68
ENSG00000173898.17 3.84378 0.224648 17.1103 1.24430e-65 1.42389e-61
ENSG00000287828.2 3.74329 0.230517 16.2387 2.68670e-59 2.30586e-55
ENSG00000111110.13 2.84100 0.179182 15.8554 1.28977e-56 8.85559e-53
Get differentially expressed genes by contrast
tumor_eocrc_vs_normal_eocrc:
baseMean log2FoldChange lfcSE stat pvalue
ENSG00000105464.5 140.90415 4.172698 0.3545970 11.76744 5.744397e-32
ENSG00000164283.13 109.78264 5.232688 0.4532636 11.54447 7.872463e-31
ENSG00000175832.14 726.72663 4.370810 0.3983460 10.97239 5.188033e-28
ENSG00000107159.15 388.76790 5.636863 0.5534101 10.18569 2.297276e-24
ENSG00000163347.6 625.01831 4.503009 0.4417764 10.19296 2.131718e-24
ENSG00000170373.9 396.55271 7.034988 0.6893200 10.20569 1.869817e-24
ENSG00000172031.8 75.79835 4.432616 0.4332061 10.23212 1.423717e-24
ENSG00000236081.2 137.80967 4.789244 0.4705045 10.17895 2.461911e-24
padj
ENSG00000105464.5 1.920180e-27
ENSG00000164283.13 1.315764e-26
ENSG00000175832.14 5.780679e-24
ENSG00000107159.15 1.028679e-20
ENSG00000163347.6 1.028679e-20
ENSG00000170373.9 1.028679e-20
ENSG00000172031.8 1.028679e-20
ENSG00000236081.2 1.028679e-20
[ reached 'max' / getOption("max.print") -- omitted 6208 rows ]
baseMean log2FoldChange lfcSE stat pvalue
ENSG00000000003.17 2857.21111 0.1616879 0.2714131 0.5957264 5.513580e-01
ENSG00000000005.6 51.21913 1.3275579 0.4465390 2.9729944 2.949099e-03
ENSG00000000419.15 1670.06587 1.2283879 0.2211246 5.5551835 2.773202e-08
ENSG00000000457.15 485.04283 -0.3559857 0.1470883 -2.4202180 1.551120e-02
ENSG00000000460.18 223.44756 1.3877793 0.2065736 6.7180869 1.841258e-11
ENSG00000000938.14 259.37230 0.2577540 0.3653517 0.7054955 4.805019e-01
ENSG00000000971.18 1250.84010 -1.1620404 0.2475734 -4.6937199 2.682813e-06
ENSG00000001036.15 1874.82106 0.2042679 0.1358121 1.5040486 1.325689e-01
padj
ENSG00000000003.17 6.900129e-01
ENSG00000000005.6 1.271338e-02
ENSG00000000419.15 7.869256e-07
ENSG00000000457.15 4.838946e-02
ENSG00000000460.18 1.667961e-09
ENSG00000000938.14 6.286395e-01
ENSG00000000971.18 3.666328e-05
ENSG00000001036.15 2.519977e-01
[ reached 'max' / getOption("max.print") -- omitted 37857 rows ]
class: DESeqDataSet
dim: 37865 42
metadata(9): abstract Accession ... Relevance version
assays(6): counts mu ... replaceCounts replaceCooks
rownames(37865): ENSG00000000003.17 ENSG00000000005.6 ...
ENSG00000310592.1 ENSG00000310593.1
rowData names(26): transcript_id gene_id ... maxCooks replace
colnames(42): srr17866817 srr17866818 ... srr17866857 srr17866858
colData names(34): age_at_surgery tissue ... sizeFactor replaceable
EOCRC_UP LOCRC_NS Counts
1 0 0 30699
2 0 1 3541
3 1 0 1274
4 1 1 19
attr(,"class")
[1] "VennCounts"
EOCRC_DOWN LOCRC_NS Counts
1 0 0 30804
2 0 1 3539
3 1 0 1169
4 1 1 21
attr(,"class")
[1] "VennCounts"
Extract genes significantly differentially expressed in EOCRC
[1] 19
[1] "ENSG00000085741.15" "ENSG00000099194.7" "ENSG00000102854.17"
[4] "ENSG00000105825.15" "ENSG00000105835.14" "ENSG00000106236.5"
[7] "ENSG00000136689.20" "ENSG00000140279.13" "ENSG00000146054.19"
[10] "ENSG00000150337.15" "ENSG00000159167.12" "ENSG00000165171.12"
[13] "ENSG00000167774.2" "ENSG00000176971.4" "ENSG00000181458.11"
[16] "ENSG00000185201.18" "ENSG00000205609.14" "ENSG00000243081.3"
[19] "ENSG00000274585.2"
[1] 21
[1] "ENSG00000047648.23" "ENSG00000110076.21" "ENSG00000118523.7"
[4] "ENSG00000126353.5" "ENSG00000126803.10" "ENSG00000127954.14"
[7] "ENSG00000128709.13" "ENSG00000161405.18" "ENSG00000169507.10"
[10] "ENSG00000184304.17" "ENSG00000185905.5" "ENSG00000198910.15"
[13] "ENSG00000211650.2" "ENSG00000211935.3" "ENSG00000211942.3"
[16] "ENSG00000211959.2" "ENSG00000214548.20" "ENSG00000224650.2"
[19] "ENSG00000243290.3" "ENSG00000274576.2" "ENSG00000276775.1"
Create a table with the genes significantly differentially expressed in EOCRC
ENSEMBL_ID EOCRC_baseMean EOCRC_log2FC EOCRC_padj LOCRC_baseMean
1 ENSG00000085741 182.214 1.820 0.001 111.907
2 ENSG00000099194 4216.099 1.099 0.025 6498.227
3 ENSG00000102854 760.166 2.326 0.000 287.297
4 ENSG00000105825 73.805 1.257 0.023 211.037
LOCRC_log2FC LOCRC_padj Direction Gene ENTREZID PATH
1 0.544 0.257 Up WNT11 7481 04310
2 0.657 0.238 Up SCD 6319 01040
3 0.497 0.309 Up MSLN 10232 <NA>
4 0.668 0.382 Up TFPI2 7980 <NA>
[ reached 'max' / getOption("max.print") -- omitted 36 rows ]
baseMean gives the mean counts for the gene in each cohort. log2FC is the log2 fold change in tumors compared to normal samples. padj is the adjusted p-value. Values rounded to 3 decimal points.
ID Description Count
GO:0050789 GO:0050789 regulation of biological process 19
GO:0050794 GO:0050794 regulation of cellular process 18
GO:0008152 GO:0008152 metabolic process 16
GO:0006955 GO:0006955 immune response 14
GO:0048518 GO:0048518 positive regulation of biological process 14
GO:0007154 GO:0007154 cell communication 13
GO:0048522 GO:0048522 positive regulation of cellular process 13
GO:0071840 GO:0071840 cellular component organization or biogenesis 13
GO:0023052 GO:0023052 signaling 13
GO:0007165 GO:0007165 signal transduction 12
GO:0051716 GO:0051716 cellular response to stimulus 12
GO:0048856 GO:0048856 anatomical structure development 12
GeneRatio
GO:0050789 19/40
GO:0050794 18/40
GO:0008152 16/40
GO:0006955 14/40
GO:0048518 14/40
GO:0007154 13/40
GO:0048522 13/40
GO:0071840 13/40
GO:0023052 13/40
GO:0007165 12/40
GO:0051716 12/40
GO:0048856 12/40
[ reached 'max' / getOption("max.print") -- omitted 13 rows ]
ID
GO:0016064 GO:0016064
GO:0019724 GO:0019724
GO:0002449 GO:0002449
GO:0002460 GO:0002460
Description
GO:0016064 immunoglobulin mediated immune response
GO:0019724 B cell mediated immunity
GO:0002449 lymphocyte mediated immunity
GO:0002460 adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains
GeneRatio BgRatio RichFactor FoldEnrichment zScore pvalue
GO:0016064 7/35 203/18860 0.03448276 18.581281 10.859341 7.860296e-08
GO:0019724 7/35 206/18860 0.03398058 18.310680 10.771783 8.689659e-08
GO:0002449 7/35 388/18860 0.01804124 9.721649 7.484855 6.043876e-06
GO:0002460 7/35 402/18860 0.01741294 9.383085 7.325722 7.619376e-06
p.adjust qvalue
GO:0016064 4.001588e-05 3.338659e-05
GO:0019724 4.001588e-05 3.338659e-05
GO:0002449 1.754361e-03 1.463722e-03
GO:0002460 1.754361e-03 1.463722e-03
geneID Count
GO:0016064 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4 7
GO:0019724 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4 7
GO:0002449 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4 7
GO:0002460 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4 7
[ reached 'max' / getOption("max.print") -- omitted 7 rows ]
# Convert to dataframe
ego_df <- as.data.frame(ego)
# Visualize the top cateogories
ego_head15 <- head(ego_df, 15)
# Save file
write.csv(ego_head15, file = "ego_top15_GO_BP.csv", row.names = FALSE)
The Gene-Concept Network displays which genes are associated with which biological processes, making it easy to see gene-function relationships. Meanwhile, the enrichment map groups similar terms together. Terms that share genes are connected, making related functions easy to spot.
# Load modules and libraries
library(DOSE)
# Convert gene ID to Symbol
egox <- setReadable(ego, "org.Hs.eg.db", "ENTREZID")
# Generate geneList with Log2FoldChange data
geneList <- final_table$EOCRC_log2FC
names(geneList) <- as.character(unique(final_table$ENTREZID))
geneList <- sort(geneList, decreasing = TRUE)
head(geneList)
4885 10232 7481 6781 6066 50506
2.551 2.326 1.820 1.735 1.648 1.513
# Network plot of enriched terms CategorySize can be scaled by 'pvalue' or
# 'geneNum'
p1 <- cnetplot(egox, categorySize = "pvalue", foldChange = geneList)
# Display the plot
print(p1)
# Calculate pairwise similarities of enriched terms
egox_pairwise <- pairwise_termsim(egox)
# Creates an enrichment map visualization from pairwise enrichment analysis
# results
p2 <- emapplot(egox_pairwise, layout = "kk", node_label_size = 3)
# Display the plot
print(p2)
# Plot Gene-Concept Network and Enrichment Map in a PDF file
# Plot 1
pdf("gene_concept_network_plot_30genes.pdf", width = 6, height = 4)
print(p1)
dev.off()
quartz_off_screen
2
# Plot 2
pdf("enrichment_map_30genes.pdf", width = 6, height = 4)
print(p2)
dev.off()
quartz_off_screen
2
# # Start PDF device pdf(file=paste('Merge_GO_Gene_Set_Enrichment_Analysis',
# 'CC', '.pdf', sep='_'), width=25, height=10) # Create the plot
# cowplot::plot_grid(p1, p2, ncol=2, labels=LETTERS[1:3], rel_widths=c(.8, .8,
# 1.2)) # Close PDF device dev.off()
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.5.1 (2025-06-13)
os macOS Ventura 13.7.3
system x86_64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Madrid
date 2026-02-03
pandoc 3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64/ (via rmarkdown)
quarto 1.5.57 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
abind 1.4-8 2024-09-12 [1] CRAN (R 4.5.0)
affy 1.88.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
affyio 1.80.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
AnnotationDbi * 1.72.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
ape 5.8-1 2024-12-16 [1] CRAN (R 4.5.0)
apeglm * 1.32.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
aplot 0.2.9 2025-09-12 [1] CRAN (R 4.5.1)
base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.5.0)
bbmle 1.0.25.1 2023-12-09 [1] CRAN (R 4.5.0)
bdsmatrix 1.3-7 2024-03-02 [1] CRAN (R 4.5.0)
beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.5.0)
Biobase * 2.70.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
BiocGenerics * 0.56.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
BiocManager * 1.30.27 2025-11-14 [1] CRAN (R 4.5.1)
BiocParallel 1.44.0 2025-10-30 [1] Bioconductor 3.22 (R 4.5.1)
Biostrings 2.78.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
bit 4.6.0 2025-03-06 [1] CRAN (R 4.5.0)
bit64 4.6.0-1 2025-01-16 [1] CRAN (R 4.5.0)
blob 1.2.4 2023-03-17 [1] CRAN (R 4.5.0)
bslib 0.9.0 2025-01-30 [1] CRAN (R 4.5.0)
cachem 1.1.0 2024-05-16 [1] CRAN (R 4.5.0)
cli 3.6.5 2025-04-23 [1] CRAN (R 4.5.0)
cluster 2.1.8.1 2025-03-12 [1] CRAN (R 4.5.1)
clusterProfiler * 4.18.4 2025-12-15 [1] Bioconductor 3.22 (R 4.5.2)
coda 0.19-4.1 2024-01-31 [1] CRAN (R 4.5.0)
codetools 0.2-20 2024-03-31 [1] CRAN (R 4.5.1)
corrplot * 0.95 2024-10-14 [1] CRAN (R 4.5.0)
cowplot 1.2.0 2025-07-07 [1] CRAN (R 4.5.1)
crayon 1.5.3 2024-06-20 [1] CRAN (R 4.5.0)
data.table 1.17.8 2025-07-10 [1] CRAN (R 4.5.1)
DBI 1.2.3 2024-06-02 [1] CRAN (R 4.5.0)
DelayedArray 0.36.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
DESeq2 * 1.50.2 2025-11-17 [1] Bioconductor 3.22 (R 4.5.2)
devtools 2.4.6 2025-10-03 [1] CRAN (R 4.5.1)
digest 0.6.39 2025-11-19 [1] CRAN (R 4.5.1)
DOSE * 4.4.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.5.0)
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.5.0)
emdbook 1.3.14 2025-07-23 [1] CRAN (R 4.5.1)
enrichplot * 1.30.4 2025-12-01 [1] Bioconductor 3.22 (R 4.5.2)
evaluate 1.0.5 2025-08-27 [1] CRAN (R 4.5.1)
factoextra * 1.0.7 2020-04-01 [1] CRAN (R 4.5.0)
farver 2.1.2 2024-05-13 [1] CRAN (R 4.5.0)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.5.0)
fastmatch 1.1-6 2024-12-23 [1] CRAN (R 4.5.0)
fgsea 1.36.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
fontBitstreamVera 0.1.1 2017-02-01 [1] CRAN (R 4.5.0)
fontLiberation 0.1.0 2016-10-15 [1] CRAN (R 4.5.0)
fontquiver 0.2.1 2017-02-01 [1] CRAN (R 4.5.0)
formatR 1.14 2023-01-17 [1] CRAN (R 4.5.0)
fs 1.6.6 2025-04-12 [1] CRAN (R 4.5.0)
gdtools 0.4.4 2025-10-06 [1] CRAN (R 4.5.1)
generics * 0.1.4 2025-05-09 [1] CRAN (R 4.5.0)
GenomicRanges * 1.62.1 2025-12-08 [1] Bioconductor 3.22 (R 4.5.2)
ggbeeswarm * 0.7.3 2025-11-29 [1] CRAN (R 4.5.1)
ggforce 0.5.0 2025-06-18 [1] CRAN (R 4.5.0)
ggfortify * 0.4.19 2025-07-27 [1] CRAN (R 4.5.1)
ggfun 0.2.0 2025-07-15 [1] CRAN (R 4.5.1)
ggiraph 0.9.2 2025-10-07 [1] CRAN (R 4.5.1)
ggnewscale * 0.5.2 2025-06-20 [1] CRAN (R 4.5.0)
ggplot2 * 4.0.1 2025-11-14 [1] CRAN (R 4.5.1)
ggplotify 0.1.3 2025-09-20 [1] CRAN (R 4.5.1)
ggrepel * 0.9.6 2024-09-07 [1] CRAN (R 4.5.0)
ggtangle 0.0.9 2025-11-30 [1] CRAN (R 4.5.1)
ggtree 4.0.3 2025-12-18 [1] Bioconductor 3.22 (R 4.5.2)
glue 1.8.0 2024-09-30 [1] CRAN (R 4.5.0)
GO.db 3.22.0 2025-11-21 [1] Bioconductor
GOSemSim 2.36.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.5.0)
gridGraphics 0.5-1 2020-12-13 [1] CRAN (R 4.5.0)
gson 0.1.0 2023-03-07 [1] CRAN (R 4.5.0)
gtable 0.3.6 2024-10-25 [1] CRAN (R 4.5.0)
hexbin * 1.28.5 2024-11-13 [1] CRAN (R 4.5.0)
htmltools 0.5.9 2025-12-04 [1] CRAN (R 4.5.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.5.0)
httr 1.4.7 2023-08-15 [1] CRAN (R 4.5.0)
igraph 2.2.1 2025-10-27 [1] CRAN (R 4.5.1)
IRanges * 2.44.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.5.0)
jsonlite 2.0.0 2025-03-27 [1] CRAN (R 4.5.0)
KEGGREST 1.50.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
knitr * 1.51 2025-12-20 [1] CRAN (R 4.5.1)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.5.0)
lattice 0.22-7 2025-04-02 [1] CRAN (R 4.5.1)
lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.5.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.5.0)
limma * 3.66.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
locfit 1.5-9.12 2025-03-05 [1] CRAN (R 4.5.0)
magrittr 2.0.4 2025-09-12 [1] CRAN (R 4.5.1)
MASS 7.3-65 2025-02-28 [1] CRAN (R 4.5.1)
Matrix 1.7-4 2025-08-28 [1] CRAN (R 4.5.1)
MatrixGenerics * 1.22.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
matrixStats * 1.5.0 2025-01-07 [1] CRAN (R 4.5.0)
memoise 2.0.1 2021-11-26 [1] CRAN (R 4.5.0)
mvtnorm 1.3-3 2025-01-10 [1] CRAN (R 4.5.0)
nlme 3.1-168 2025-03-31 [1] CRAN (R 4.5.1)
numDeriv 2016.8-1.1 2019-06-06 [1] CRAN (R 4.5.0)
org.Hs.eg.db * 3.22.0 2025-11-20 [1] Bioconductor
otel 0.2.0 2025-08-29 [1] CRAN (R 4.5.1)
patchwork 1.3.2 2025-08-25 [1] CRAN (R 4.5.1)
pheatmap * 1.0.13 2025-06-05 [1] CRAN (R 4.5.0)
pillar 1.11.1 2025-09-17 [1] CRAN (R 4.5.1)
pkgbuild 1.4.8 2025-05-26 [1] CRAN (R 4.5.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.5.0)
pkgload 1.4.1 2025-09-23 [1] CRAN (R 4.5.1)
plyr 1.8.9 2023-10-02 [1] CRAN (R 4.5.0)
png 0.1-8 2022-11-29 [1] CRAN (R 4.5.0)
polyclip 1.10-7 2024-07-23 [1] CRAN (R 4.5.0)
preprocessCore 1.72.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
purrr 1.2.0 2025-11-04 [1] CRAN (R 4.5.1)
qvalue 2.42.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.5.0)
R.oo 1.27.1 2025-05-02 [1] CRAN (R 4.5.0)
R.utils 2.13.0 2025-02-24 [1] CRAN (R 4.5.0)
R6 2.6.1 2025-02-15 [1] CRAN (R 4.5.0)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.5.0)
RColorBrewer * 1.1-3 2022-04-03 [1] CRAN (R 4.5.0)
Rcpp 1.1.0 2025-07-02 [1] CRAN (R 4.5.1)
remotes 2.5.0 2024-03-17 [1] CRAN (R 4.5.0)
repr 1.1.7 2024-03-22 [1] CRAN (R 4.5.0)
reshape2 * 1.4.5 2025-11-12 [1] CRAN (R 4.5.1)
rlang 1.1.6 2025-04-11 [1] CRAN (R 4.5.0)
rmarkdown 2.30 2025-09-28 [1] CRAN (R 4.5.1)
RSQLite 2.4.5 2025-11-30 [1] CRAN (R 4.5.1)
rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.5.0)
S4Arrays 1.10.1 2025-12-01 [1] Bioconductor 3.22 (R 4.5.2)
S4Vectors * 0.48.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
S7 0.2.1 2025-11-14 [1] CRAN (R 4.5.1)
sass 0.4.10 2025-04-11 [1] CRAN (R 4.5.0)
scales * 1.4.0 2025-04-24 [1] CRAN (R 4.5.0)
scatterpie 0.2.6 2025-09-12 [1] CRAN (R 4.5.1)
Seqinfo * 1.0.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.5.0)
skimr * 2.2.2 2026-01-10 [1] CRAN (R 4.5.1)
SparseArray 1.10.8 2025-12-18 [1] Bioconductor 3.22 (R 4.5.2)
statmod 1.5.1 2025-10-09 [1] CRAN (R 4.5.1)
stringi 1.8.7 2025-03-27 [1] CRAN (R 4.5.0)
stringr 1.6.0 2025-11-04 [1] CRAN (R 4.5.1)
SummarizedExperiment * 1.40.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
systemfonts 1.3.1 2025-10-01 [1] CRAN (R 4.5.1)
tibble 3.3.0 2025-06-08 [1] CRAN (R 4.5.0)
tidydr 0.0.6 2025-07-25 [1] CRAN (R 4.5.1)
tidyr * 1.3.2 2025-12-19 [1] CRAN (R 4.5.1)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.5.0)
tidytree 0.4.6 2023-12-12 [1] CRAN (R 4.5.0)
treeio 1.34.0 2025-10-30 [1] Bioconductor 3.22 (R 4.5.1)
tweenr 2.0.3 2024-02-26 [1] CRAN (R 4.5.0)
usethis 3.2.1 2025-09-06 [1] CRAN (R 4.5.1)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.5.0)
vipor 0.4.7 2023-12-18 [1] CRAN (R 4.5.0)
vsn * 3.78.1 2026-01-13 [1] https://bioc-release.r-universe.dev (R 4.5.2)
withr 3.0.2 2024-10-28 [1] CRAN (R 4.5.0)
xfun 0.55 2025-12-16 [1] CRAN (R 4.5.1)
XVector 0.50.0 2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
yaml 2.3.12 2025-12-10 [1] CRAN (R 4.5.1)
yulab.utils 0.2.3 2025-12-15 [1] CRAN (R 4.5.1)
[1] /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/library
* ── Packages attached to the search path.
──────────────────────────────────────────────────────────────────────────────