Acknowledgement: this report has been inspired by materials used in the Advanced Bioinformatics course taught by Igor Ruiz de los Mozos at UPNA.

1 Data preparation

1.1 Load dataset

After completing the nf-core/rnaseq pipeline using Salmon pseudoalignment for the analysis of 44 colorectal samples from SRA study SRP479528 will obtain the salmon.merged.gene.SummarizedExperiment.rds file. This file contains gene-level quantification matrices (counts, TPMs, effective lengths) plus sample metadata and annotations produced from the merged Salmon quantifications.

Let’s load the salmon.merged.gene.SummarizedExperiment.rds file. The content of the file has a SummarizedExperiment representation, which is an R/Bioconductor container that stores assay matrices (e.g., counts), feature metadata, and sample metadata in one coherent object.

# Specify a working directory
directory <- "/Volumes/DataWinBackup/00_MASTER UOC BIOINFORMATICA - TEMPORAL/TFM-UOC"
setwd(directory)
list.files()
 [1] "dds_DGE_42.rds"                                          
 [2] "DE_tumor_vs_normal_42.csv"                               
 [3] "DE_tumor_vs_normal_44.csv"                               
 [4] "Deseq2_42eocrc"                                          
 [5] "Deseq2_44crc copy.Rmd"                                   
 [6] "Deseq2_44crc_cache"                                      
 [7] "Deseq2_44crc_files"                                      
 [8] "Deseq2_44crc-copy_cache"                                 
 [9] "Deseq2_44crc-copy.Rmd"                                   
[10] "Deseq2_44crc.html"                                       
[11] "Deseq2_44crc.Rmd"                                        
[12] "DESeq2_sig_genes_normal_vs_cancer_44_locrc_annotated.tab"
[13] "DESeq2_sig_genes_normal_vs_cancer_44_locrc.tab"          
[14] "DTU44CRC"                                                
[15] "ego_top15_GO_BP.csv"                                     
[16] "enrichment_map_30genes.pdf"                              
[17] "figures"                                                 
[18] "gene_concept_network_plot_30genes.pdf"                   
[19] "gene.SummarizedExperiment.metadata_44.rds"               
[20] "gene.SummarizedExperiment.metadata.rds"                  
[21] "images"                                                  
[22] "RNAstructure"                                            
[23] "salmon.merged.gene.SummarizedExperiment_gencode.rds"     
[24] "salmon.merged.gene.SummarizedExperiment.rds"             
[25] "salmon.merged.transcript.SummarizedExperiment.rds"       
[26] "sraruntable_srp479528_crc44_metadata.csv"                
[27] "sraruntable_srp479528_crc44.csv"                         
[28] "tables"                                                  
[29] "TFM_schematic_thesis.drawio"                             
[30] "TFM_schematic.drawio"                                    
[31] "TFM_schematic.png"                                       
[32] "tumor_vs_normal_42_dds.csv"                              
# Read 'se' object obtained from nfcore/rnaseq
se <- readRDS("salmon.merged.gene.SummarizedExperiment_gencode.rds")

# Display se object
print(se)
class: SummarizedExperiment 
dim: 78428 44 
metadata(0):
assays(5): counts counts_length_scaled counts_scaled lengths tpm
rownames(78428): ENSG00000000003.17 ENSG00000000005.6 ...
  ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(4): sample fastq_1 fastq_2 strandedness

We focus on the counts assay matrix and check information on columns and rows.

# Check the first elements of the counts matrix
head(assay(se, "counts")[, 1:2], 5)
                   srr27320655 srr27320656
ENSG00000000003.17    1397.945    6337.771
ENSG00000000005.6        6.000      79.000
ENSG00000000419.15     633.000    4413.999
ENSG00000000457.15     434.062     741.001
ENSG00000000460.18      88.000     662.001
# Collect sample information
head(colData(se)[, 2:3], 5)
DataFrame with 5 rows and 2 columns
                           fastq_1                fastq_2
                       <character>            <character>
srr27320655 data/reads/SRR273206.. data/reads/SRR273206..
srr27320656 data/reads/SRR273206.. data/reads/SRR273206..
srr27320657 data/reads/SRR273206.. data/reads/SRR273206..
srr27320658 data/reads/SRR273206.. data/reads/SRR273206..
srr27320659 data/reads/SRR273206.. data/reads/SRR273206..
# Collect gene information
head(rowData(se), 3)
DataFrame with 3 rows and 3 columns
                            transcript_id            gene_id   gene_name
                              <character>        <character> <character>
ENSG00000000003.17 ENST00000373020.9,EN.. ENSG00000000003.17      TSPAN6
ENSG00000000005.6  ENST00000373031.5,EN..  ENSG00000000005.6        TNMD
ENSG00000000419.15 ENST00000858753.1,EN.. ENSG00000000419.15        DPM1
# Collect genomic ranges
head(rowRanges(se), 3)
NULL

Below, we can obtain the name of the 44 samples.

# Display colNames
colData(se)$sample
 [1] "srr27320655" "srr27320656" "srr27320657" "srr27320658" "srr27320659"
 [6] "srr27320660" "srr27320661" "srr27320662" "srr27320663" "srr27320664"
[11] "srr27320665" "srr27320666" "srr27320667" "srr27320668" "srr27320669"
[16] "srr27320670" "srr27320671" "srr27320672" "srr27320673" "srr27320674"
[21] "srr27320675" "srr27320676" "srr27320677" "srr27320678" "srr27320679"
[26] "srr27320680" "srr27320681" "srr27320682" "srr27320683" "srr27320684"
[31] "srr27320685" "srr27320686" "srr27320687" "srr27320688" "srr27320689"
[36] "srr27320690" "srr27320691" "srr27320692" "srr27320693" "srr27320694"
[41] "srr27320695" "srr27320696" "srr27320697" "srr27320698"

Since the colData field in the se object is limited to data from the original FASTQ files, it would be useful to supplement it with metadata from the SRA study, such as patient age and tissue type.

1.2 Preparation of .csv files

In this section we describe the steps to add metadata from the SRA study into the colData field of the SummarizedExperiment object obtained from the nf-core/rnaseq pipeline:

  • Go to SRA study SRP479528
  • Download metadata of 44 runs
  • Open the downloaded SraRunTable.csv file in Google Sheets
  • Convert the spreadsheet into a table
  • Order data by run number SRR27320655, SRR27320656, SRR27320657…
  • Download it as .csv file and rename it sraruntable_srp479528_crc44.csv

Here are the first lines of the sraruntable_srp479528_crc44.csv file.

Run,age_at_surgery,tissue,tumor_stage,Sample Name,SRA Study,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,Bytes,Center Name,Collection_Date,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,Experiment,geo_loc_name_country,geo_loc_name_country_continent,geo_loc_name,Instrument,Library Name,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,create_date,version,Column 1,source_name
SRR27320655,88.75,Rectum,NA,GSM7989032,SRP479528,RNA-Seq,117,2304341839,PRJNA1055547,SAMN39058850,870996594,"BIOCHEMISTRY, COLORECTAL SURGERY, PENN STATE COLLEGE OF MEDICINE",missing,public,"fastq,run.zq,sra","ncbi,gs,s3","s3.us-east-1,ncbi.public,gs.us-east1",SRX22997924,uncalculated,uncalculated,missing,Illumina NovaSeq 6000,GSM7989032,PAIRED,cDNA,TRANSCRIPTOMIC,Homo sapiens,ILLUMINA,2024-04-11T00:00:00Z,2023-12-21T21:33:00Z,1,,Rectum

As follows, a metadata .csv file was created using information from BioProject PRJNA1055547. Here are the first lines of the created sraruntable_srp479528_crc44_metadata.csv file.

key,values
abstract,"The incidence of colorectal cancer (CRC) has been steadily increasing in younger individuals over the past several decades for reasons that are incompletely defined..."
Accession,PRJNA1055547; GEO: GSE251845
Data Type,Transcriptome or Gene expression

1.3 Update SummarizedExperiment object

Let’s add metadata contained in the previously created sraruntable_srp479528_crc44.csv and sraruntable_srp479528_crc44_metadata.csv files into the se object.

After loading the .csv files, here we can see the samples metadata represented as a dataframe named df_coldata:

# Load files
df_coldata <- read.csv("sraruntable_srp479528_crc44.csv", header = TRUE, row.names = 1,
    check.names = FALSE)

# Display first rows
head(df_coldata[, 1:5], n = 3)
            age_at_surgery          tissue tumor_stage Sample Name SRA Study
SRR27320655          88.75          Rectum          NA  GSM7989032 SRP479528
SRR27320656          88.75          Rectum           1  GSM7989031 SRP479528
SRR27320657          68.00 Hepatic flexure          NA  GSM7989030 SRP479528

and below is the metadata corresponding to the SRA study represented as a dataframe named df_metadata with two columns: key and values.

# Load files
df_metadata <- read.csv("sraruntable_srp479528_crc44_metadata.csv", header = TRUE,
    check.names = FALSE)

# Display first rows
head(df_metadata, n = 3)
        key
1  abstract
2 Accession
3 Data Type
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      values
1 The incidence of colorectal cancer (CRC) has been steadily increasing in younger individuals over the past several decades for reasons that are incompletely defined. Identifying differences in gene expression profiles, or transcriptomes, in early-onset colorectal cancer (EOCRC, < 50 years old) patients versus later-onset colorectal cancer (LOCRC, > 50 years old) patients is one approach to understanding molecular and genetic features that distinguish EOCRC. Methods: We performed RNA-sequencing (RNA-seq) to characterize the transcriptomes of patient-matched tumors and adjacent, uninvolved (normal) colonic segments from EOCRC (n=21) and LOCRC (n=22) patients. The EOCRC and LOCRC cohorts were matched for demographic and clinical characteristics. We used The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) database for validation. We used a series of computational and bioinformatic tools to identify EOCRC-specific differentially expressed genes, molecular pathways, predicted cell populations, differential gene splicing events, and predicted neoantigens. Results: We identified an eight-gene signature in EOCRC comprised of ALDOB, FBXL16, IL1RN, MSLN, RAC3, SLC38A11, WBSCR27 and WNT11, from which we developed a score predictive of overall CRC patient survival. On the entire set of genes identified in normal tissues and tumors, cell type deconvolution analysis predicted a differential abundance of immune and non-immune populations in EOCRC versus LOCRC. Gene set enrichment analysis identified increased expression of splicing machinery in EOCRC. We further found differences in alternative splicing (AS) events, including one within the long non-coding RNA, HOTAIRM1. Additional analysis of AS found seven events specific to EOCRC that encode potential neoantigens. Conclusion: Our transcriptome analyses identified genetic and molecular features specific to EOCRC which may inform future screening, development of prognostic indicators, and novel drug targets. Overall design: Gene expression profiles of 22 surgically resected tumors and patient-matched adjacent colonic segments from colorectal cancer patients were generated with RNA-sequencing.
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               PRJNA1055547; GEO: GSE251845
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Transcriptome or Gene expression

The lower_underscore_nospaces() function will help us standardize the format of character-type data.

# Define the lower_underscore_nospaces() function
lower_underscore_nospaces <- function(df) {
    non_numeric_col <- names(df)[!sapply(df, is.numeric)]
    for (col in non_numeric_col) {
        df[[col]] <- tolower(df[[col]])  # lowercase
        df[[col]] <- gsub("[^a-z0-9]+", "_", df[[col]])  # replace non-alphanumeric with underscores
        df[[col]] <- gsub("_$", "", df[[col]])  # remove trailing underscores
    }
    if (!is.null(names(df))) {
        names(df) <- tolower(names(df))
        names(df) <- gsub("[^a-z0-9]+", "_", names(df))
        names(df) <- gsub("_$", "", names(df))
    }
    if (!is.null(rownames(df))) {
        rownames(df) <- tolower(rownames(df))
        rownames(df) <- gsub("[^a-z0-9]+", "_", rownames(df))
        rownames(df) <- gsub("_$", "", rownames(df))
    }
    return(df)
}

# Apply lower_underscore_nospaces to all datasets
df_coldata <- lower_underscore_nospaces(df_coldata)

# Display last changes on dataframes
head(df_coldata[, 1:5], n = 3)
            age_at_surgery          tissue tumor_stage sample_name sra_study
srr27320655          88.75          rectum          NA  gsm7989032 srp479528
srr27320656          88.75          rectum           1  gsm7989031 srp479528
srr27320657          68.00 hepatic_flexure          NA  gsm7989030 srp479528

Since the metadata(se) field in a SummarizedExperiment object stores data as a named list, let’s convert df_metadata to this format.

# Convert metadata dataframe into a named list
list_metadata <- as.list(df_metadata$values)
names(list_metadata) <- df_metadata$key
list_metadata
$abstract
[1] "The incidence of colorectal cancer (CRC) has been steadily increasing in younger individuals over the past several decades for reasons that are incompletely defined. Identifying differences in gene expression profiles, or transcriptomes, in early-onset colorectal cancer (EOCRC, < 50 years old) patients versus later-onset colorectal cancer (LOCRC, > 50 years old) patients is one approach to understanding molecular and genetic features that distinguish EOCRC. Methods: We performed RNA-sequencing (RNA-seq) to characterize the transcriptomes of patient-matched tumors and adjacent, uninvolved (normal) colonic segments from EOCRC (n=21) and LOCRC (n=22) patients. The EOCRC and LOCRC cohorts were matched for demographic and clinical characteristics. We used The Cancer Genome Atlas Colon Adenocarcinoma (TCGA-COAD) database for validation. We used a series of computational and bioinformatic tools to identify EOCRC-specific differentially expressed genes, molecular pathways, predicted cell populations, differential gene splicing events, and predicted neoantigens. Results: We identified an eight-gene signature in EOCRC comprised of ALDOB, FBXL16, IL1RN, MSLN, RAC3, SLC38A11, WBSCR27 and WNT11, from which we developed a score predictive of overall CRC patient survival. On the entire set of genes identified in normal tissues and tumors, cell type deconvolution analysis predicted a differential abundance of immune and non-immune populations in EOCRC versus LOCRC. Gene set enrichment analysis identified increased expression of splicing machinery in EOCRC. We further found differences in alternative splicing (AS) events, including one within the long non-coding RNA, HOTAIRM1. Additional analysis of AS found seven events specific to EOCRC that encode potential neoantigens. Conclusion: Our transcriptome analyses identified genetic and molecular features specific to EOCRC which may inform future screening, development of prognostic indicators, and novel drug targets. Overall design: Gene expression profiles of 22 surgically resected tumors and patient-matched adjacent colonic segments from colorectal cancer patients were generated with RNA-sequencing."

$Accession
[1] "PRJNA1055547; GEO: GSE251845"

$`Data Type`
[1] "Transcriptome or Gene expression"

$Scope
[1] "Multiisolate"

$Organism
[1] "Homo sapiens[Taxonomy ID: 9606]Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo; Homo sapiens"

$Publications
[1] "Marx OM et al., \"Identification of differentially expressed genes and splicing events in early-onset colorectal cancer.\", Front Oncol, 2024;14:1365762"

$Grants
[1] "\"Wnt/beta-catenin signaling in early-onset colorectal cancer\" (Grant ID R03 CA279861, National Cancer Institute)"

$Submission
[1] "Registration date: 21-Dec-2023\nBiochemistry, Colorectal Surgery, Penn State College of Medicine"

$Relevance
[1] "Medical"

Now let’s convert R regular dataframes into Bioconductor S4 DataFrame objects.

# Convert regular R dataframes to Bioconductor S4 DataFrame objects
colData <- DataFrame(df_coldata)

# Visualize data converted to S4 DataFrame S4 type
print(colData)
DataFrame with 44 rows and 32 columns
            age_at_surgery          tissue tumor_stage sample_name   sra_study
                 <numeric>     <character>   <integer> <character> <character>
             assay_type avgspotlen      bases   bioproject    biosample
            <character>  <integer>  <numeric>  <character>  <character>
                 bytes            center_name collection_date     consent
             <integer>            <character>     <character> <character>
            datastore_filetype datastore_provider       datastore_region
                   <character>        <character>            <character>
             experiment geo_loc_name_country geo_loc_name_country_continent
            <character>          <character>                    <character>
            geo_loc_name            instrument library_name librarylayout
             <character>           <character>  <character>   <character>
            libraryselection  librarysource     organism    platform
                 <character>    <character>  <character> <character>
                     releasedate          create_date   version     source_name
                     <character>          <character> <integer>     <character>
 [ reached 'max' / getOption("max.print") -- omitted 11 rows ]

and verify that their dimensions match.

# Check dimmensions of each dataset
cat("Dimensions of counts assay:", dim(assay(se, "counts")), "\n", "Dimensions of colData:",
    dim(colData), "\n", "Dimensions of metadata:", length(list_metadata), "\n")
Dimensions of counts assay: 78428 44 
 Dimensions of colData: 44 32 
 Dimensions of metadata: 9 

Finally, let’s update the SummarizedExperiment object to include information on colData and metadata fields.

# Update SE object
colData(se) <- colData

# Add metadata
metadata(se) <- list_metadata
se
class: SummarizedExperiment 
dim: 78428 44 
metadata(9): abstract Accession ... Submission Relevance
assays(5): counts counts_length_scaled counts_scaled lengths tpm
rownames(78428): ENSG00000000003.17 ENSG00000000005.6 ...
  ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(32): age_at_surgery tissue ... version source_name

We’re on the right track. To move on, let’s include a column named condition, which will convert the tumor_stage column of colData into a binary factor variable with levels normal and tumor. A good practice in R is that the first level be the reference one (e.g. normal, control, or untreated samples). This will come handy when conducting the differential expression analysis with the volcano plots.

# Create new colData column named condition
colData(se)$condition <- ifelse(is.na(colData(se)$tumor_stage), "normal_locrc", "tumor_locrc")
colData(se)$condition <- as.factor(colData(se)$condition)
colData(se)$condition <- relevel(colData(se)$condition, ref = "normal_locrc")
colData(se)$condition
 [1] normal_locrc tumor_locrc  normal_locrc tumor_locrc  normal_locrc
 [6] tumor_locrc  normal_locrc tumor_locrc  normal_locrc tumor_locrc 
[11] normal_locrc tumor_locrc  normal_locrc tumor_locrc  normal_locrc
[16] tumor_locrc  normal_locrc tumor_locrc  normal_locrc tumor_locrc 
[21] normal_locrc tumor_locrc  normal_locrc tumor_locrc  normal_locrc
[26] tumor_locrc  normal_locrc tumor_locrc  normal_locrc tumor_locrc 
[31] normal_locrc tumor_locrc  normal_locrc tumor_locrc  normal_locrc
[36] tumor_locrc  normal_locrc tumor_locrc  normal_locrc tumor_locrc 
[41] normal_locrc tumor_locrc  normal_locrc tumor_locrc 
Levels: normal_locrc tumor_locrc

We can now save the se object for reuse, eliminating the need to rerun this section.

# Save the updated `se` object to a new file
saveRDS(se, file = "gene.SummarizedExperiment.metadata_44.rds")

1.4 Create a DESeqDataSet data object

A DESeqDataSet (dds) object encapsulates all components needed for DE analysis. It ensures statistical correctness and handles data integrity by combining:

  • Count data (from assays)
  • Sample metadata (from colData)
  • Experimental design (formula)
  • Gene metadata (from rowData)

The first step to convert our se into DESeqDataSet is to ensure that count data from the assay be a matrix data type. Currently, assay(se) is dataframe class containing lists.

# Check class and mode of each assay in se
sapply(assays(se), class)
              counts counts_length_scaled        counts_scaled 
        "data.frame"         "data.frame"         "data.frame" 
             lengths                  tpm 
        "data.frame"         "data.frame" 
sapply(assays(se), mode)
              counts counts_length_scaled        counts_scaled 
              "list"               "list"               "list" 
             lengths                  tpm 
              "list"               "list" 

Let’s convert assay(se, "counts") into a numeric matrix and round count data.

# Extract the counts assay as a matrix
counts_mat <- as.matrix(assay(se, "counts"))
assay(se, "counts") <- round(counts_mat)

# Keep only the 'counts' assay
assays(se) <- assays(se)["counts"]

# Check class and mode of each assay in se
sapply(assays(se), class)
     counts  
[1,] "matrix"
[2,] "array" 
sapply(assays(se), mode)
   counts 
"numeric" 

Now, we’ll remove genes with less than 10 reads on all samples. This practice will save computational resources when conducting the DE analysis. It is preferable to run this task before converting from se to a DESeqDataSet due to memory constraints.

gene_count_10plus <- rowSums(assay(se)) > 10
se_filtered <- se[gene_count_10plus, ]
se_filtered
class: SummarizedExperiment 
dim: 37241 44 
metadata(9): abstract Accession ... Submission Relevance
assays(1): counts
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
  ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(33): age_at_surgery tissue ... source_name condition

Finally, we construct DESeqDataSet considering condition (normal vs. tumor) as part of our experimental design.

# Create a DESeqDataSet data object
deseq2_raw <- DESeqDataSet(se_filtered, design = ~1)

# Display first lines of DESeqDataSet data object
class(deseq2_raw)
[1] "DESeqDataSet"
attr(,"package")
[1] "DESeq2"
head(counts(deseq2_raw)[, 1:5], n = 3)
                   srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17        1398        6338        1824        1224         944
ENSG00000000005.6            6          79          38           3         299
ENSG00000000419.15         633        4414         916        1017         637

In essence, we have the same data but now it is a DESeqDataSet data object.

2 Exploratory Data Analyisis (EDA)

In this section, we explore in more detail the content of the DESeqDataSet data object.

# Display DESeqDataSet data object
deseq2_raw
class: DESeqDataSet 
dim: 37241 44 
metadata(10): abstract Accession ... Relevance version
assays(1): counts
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
  ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(33): age_at_surgery tissue ... source_name condition
# Show first lines
head(counts(deseq2_raw)[, 1:5], n = 3)
                   srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17        1398        6338        1824        1224         944
ENSG00000000005.6            6          79          38           3         299
ENSG00000000419.15         633        4414         916        1017         637
# Validate NAs
X <- counts(deseq2_raw)
X <- as.data.frame(X)  # convert to data.frame 
print(table(is.na(X)))

  FALSE 
1638604 

The dataset comprises 37,241 genes and 44 samples, for a total of 1,638,604 read counts, and contains no missing values (NAs).

2.1 Normalization by sequencing depth

To compare raw counts across 44 samples in a DESeq2 analysis, we need to account for library size differences. This is where normalization by sequencing depth comes handy to correct for differences in total read counts per sample. This is the baseline normalization for differential expression testing.

Let’s calculate the size factor per sample.

# Display raw counts
head(counts(deseq2_raw)[, 1:5], 5)
                   srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17        1398        6338        1824        1224         944
ENSG00000000005.6            6          79          38           3         299
ENSG00000000419.15         633        4414         916        1017         637
ENSG00000000457.15         434         741         604         553         274
ENSG00000000460.18          88         662         108         381          47
# Sum all counts per sample counts(deseq2_raw) is equivalent to assay(se,
# 'counts') from SummarizedExperiment print(colSums(counts(deseq2_raw))[1:5])

# Calculate the size factor and add it to the dataset It normalizes for
# sequencing depth differences to compare expression between samples fairly
# DESeq2 will automatically account for these depth differences in all
# downstream analyses using these size factors.
deseq2_raw <- estimateSizeFactors(deseq2_raw)
print(sizeFactors(deseq2_raw)[1:5])
srr27320655 srr27320656 srr27320657 srr27320658 srr27320659 
  0.6272915   1.2203391   1.1607579   1.0376846   0.7745416 

Remember that,

  • Size factor = 1.0: Average sequencing depth
  • Size factor < 1.0: Lower than average sequencing depth
  • Size factor > 1.0: Higher than average sequencing depth

The sizeFactor column was added to the coldData field of the DESeqDataSet data object.

# Check that the sizeFactor column was added to colData
deseq2_raw
class: DESeqDataSet 
dim: 37241 44 
metadata(10): abstract Accession ... Relevance version
assays(1): counts
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
  ENSG00000310592.1 ENSG00000310593.1
rowData names(3): transcript_id gene_id gene_name
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(34): age_at_surgery tissue ... condition sizeFactor

Now, all raw counts across 44 samples have accounted for library size differences.

# Now take a look at the sum of the total depth after normalization
# print(colSums(counts(deseq2_raw, normalized=TRUE))[1:5])

# To retrieve normalized read counts
counts_normalized <- counts(deseq2_raw, normalized = TRUE)

# Display normalized counts
head(counts_normalized[, 1:5], 5)
                   srr27320655 srr27320656 srr27320657 srr27320658 srr27320659
ENSG00000000003.17 2228.629003  5193.63847  1571.38708 1179.549163  1218.78541
ENSG00000000005.6     9.564931    64.73611    32.73723    2.891052   386.03479
ENSG00000000419.15 1009.100257  3617.02749   789.13957  980.066584   822.42194
ENSG00000000457.15  691.863367   607.20828   520.34967  532.917228   353.75763
ENSG00000000460.18  140.285660   542.47218    93.04266  367.163588    60.68105

2.2 Data transformation

For RNA-seq counts, the expected variance grows with the mean, making low-count genes contribute excessive noise or making high-count genes dominate. Therefore, it is required to implement variance stabilization so that each gene contributes equally to overall variance patterns making sample relationships and biological signals easier to detect.

2.2.1 Log2 transformation

It consists of a raw counts transformation using the formula log2(count + 1). It is useful for reducing the effect of large counts making distributions more symmetric. Use this transformation for a quick comparisons across samples and conducting exploratory visualization with PCA and heatmaps.

2.2.2 Variance Stabilizing Transformation (VST) and Regularized-logarithm Transformation (rlog)

DESeq2 offers two transformations to stabilize variance and thus transform data to become approximately homoskedastic (to see a flat trend in the meanSdPlot): the variance stabilizing transformation (VST) for negative binomial data with a dispersion-mean trend (Anders and Huber 2010), and the regularized-logarithm transformation or rlog (Love, Huber, and Anders 2014).

“VST is much faster to compute and is less sensitive to high count outliers than the rlog. Also, VST is recommended for medium-to-large datasets (n > 30). rlog tends to work well on small datasets (n < 30), potentially outperforming VST when there is a wide range of sequencing depth across samples (an order of magnitude difference.)”

# Implement VST
counts_vst <- vst(deseq2_raw, blind = TRUE)
# counts_vst is a DESeq2 object type head(assay(counts_vst), 3)

# Implement rlog
counts_rlog <- rlog(deseq2_raw, blind = TRUE)
# counts_rlog is a DESeq2 object type head(assay(counts_rlog), 3)

2.3 PCA & Hierarchical Clustering

We’ll continue an Exploratory Data Analysis (EDA) considering variance-stabilized-transformed counts.

# Plot PCA on condition type plotPCA(counts_vst) plotPCA(counts_vst,
# intgroup=c('tissue','condition'))

# Plot PCA on tissue type plotPCA(counts_vst, intgroup='tissue')

# Plot PCA on both condition and tissue type plotPCA(counts_vst,
# intgroup=c('condition', 'tissue'))
Importance of components:
                           PC1      PC2      PC3      PC4      PC5      PC6
Standard deviation     84.2860 55.65071 49.79084 41.97246 36.31274 34.32710
                            PC7      PC8      PC9     PC10     PC11     PC12
Standard deviation     32.62438 30.56495 30.29262 30.06734 27.85560 27.13645
                           PC13     PC14    PC15     PC16     PC17     PC18
Standard deviation     26.33277 26.23311 25.6769 25.48967 24.39372 24.26263
                           PC19     PC20    PC21    PC22     PC23    PC24
Standard deviation     24.01217 23.82649 23.4009 23.3180 22.82388 22.6720
                           PC25     PC26     PC27     PC28     PC29     PC30
Standard deviation     22.51250 22.18107 21.92296 21.76032 21.53951 21.10730
                           PC31     PC32     PC33     PC34     PC35     PC36
Standard deviation     20.83664 20.73212 20.20542 20.08440 19.91107 19.76794
                           PC37     PC38     PC39     PC40     PC41     PC42
Standard deviation     19.59697 19.25672 19.04298 18.87399 18.77049 18.60127
                           PC43      PC44
Standard deviation     18.09094 2.234e-13
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

PC1 and PC2 components explains approximately 27.4% of the total variability. We also identified 3 potential outliers from both conditions, which corresponds to rectum samples.

The Mahalanobis distance (MD) can identify outliers beyond the 99.9th percentile. According to this criterion, only the srr27320659 sample appears to be a potential outlier. However, we decided not to remove it since its paired sample, srr27320660, does not represent an outlier for the tumor_locrc condition.

# Detect outliers in PCA
pc_scores <- pca$x[, 1:5]  # Use first PCs
d <- mahalanobis(pc_scores, colMeans(pc_scores), cov(pc_scores))

# Set threshold at 99.9 percentile
thr <- quantile(d, 0.999)

# Enumerate outliers
outliers <- names(d[d > thr])
outliers
[1] "srr27320659"

Before conducting the DESeq2 analysis, let’s build the design matrix and evaluate which variables are potential confounded ones (condition, tissue, age_at_surgery).

Since each tissue has distinct transcriptional profiles and the normal vs tumor comparison is made within the same tissue, tissue is a potential confounded variable.

3 Differential gene expression with DESeq2

Now after the data exploration we can run the differential expression pipeline on the raw counts with a single call to the function DESeq. The null hypothesis is that there is no systematic difference between the average read count values of the different conditions for a given gene. We will calculate the fold change of read counts, assuming de differences in sequencing depth and variability. We will test normal tissue versus cancerous tissue (normal used as denominator for the fold change calculation)

# Run DGE DESeq2 pipeline Display se_filtered se_filtered

# Create a DESeqDataSet data object
dds <- DESeqDataSet(se_filtered, design = ~condition)

# Account for sequencing depth differences to compare expression between
# samples fairly
dds <- estimateSizeFactors(dds)

# Run DGE DESeq2 pipeline
dds_DGE <- DESeq(dds)
dds_DGE
class: DESeqDataSet 
dim: 37241 44 
metadata(10): abstract Accession ... Relevance version
assays(6): counts mu ... replaceCounts replaceCooks
rownames(37241): ENSG00000000003.17 ENSG00000000005.6 ...
  ENSG00000310592.1 ENSG00000310593.1
rowData names(26): transcript_id gene_id ... maxCooks replace
colnames(44): srr27320655 srr27320656 ... srr27320697 srr27320698
colData names(35): age_at_surgery tissue ... sizeFactor replaceable

We proceed to extract results obtained from the DESeq2 pipeline to obtain the base means across samples, the log2 fold variation, standard errors, p-value and p-adjusted.

# Get results from the DESeq2 pipeline
dds_DGE_results <- results(dds_DGE)
head(dds_DGE_results, n = 5)
log2 fold change (MLE): condition tumor locrc vs normal locrc 
Wald test p-value: condition tumor locrc vs normal locrc 
DataFrame with 5 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat      pvalue
                   <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSG00000000003.17 3583.2476       0.534997 0.1701837   3.14364 1.66858e-03
ENSG00000000005.6    36.6452       0.444771 0.4429624   1.00408 3.15339e-01
ENSG00000000419.15 1944.3084       1.410766 0.2075048   6.79872 1.05556e-11
ENSG00000000457.15  564.2597      -0.342297 0.0938915  -3.64567 2.66701e-04
ENSG00000000460.18  259.4372       1.826449 0.1806765  10.10895 5.04227e-24
                          padj
                     <numeric>
ENSG00000000003.17 4.82460e-03
ENSG00000000005.6  4.31815e-01
ENSG00000000419.15 1.19006e-10
ENSG00000000457.15 9.23339e-04
ENSG00000000460.18 3.53990e-22
# Inspect the meaning of the columns from the results object
mcols(dds_DGE_results, use.names = TRUE)
DataFrame with 6 rows and 2 columns
                       type            description
                <character>            <character>
baseMean       intermediate mean of normalized c..
log2FoldChange      results log2 fold change (ML..
lfcSE               results standard error: cond..
stat                results Wald statistic: cond..
pvalue              results Wald test p-value: c..
padj                results   BH adjusted p-values

The results object contains the following features: baseMean, log2FoldChange, lfcSE, stat, pvalue, and padj. Let’s get a summary of the results object. Now, let’s obtain a summary of the results object.

# Summary of DGE
summary(dds_DGE_results)

out of 37216 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 9180, 25%
LFC < 0 (down)     : 8837, 24%
outliers [1]       : 0, 0%
low counts [2]     : 2911, 7.8%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

After filtering those genes with p-adjusted < 0.05, we obtained 16.013 significant genes.

# Filter genes with p-adjusted value < 0.05 The results object behaves like a
# dataframe
table(dds_DGE_results$padj < 0.05)

FALSE  TRUE 
18317 16013 

and when we order those significant genes by p-adjusted value:

# Select significant genes with a p-adjusted value lower than 0.05
resSig <- subset(dds_DGE_results, padj < 0.05)

# Order significant genes by p-adjusted value
resSig <- resSig[order(resSig$padj), ]

head(resSig, n = 5)
log2 fold change (MLE): condition tumor locrc vs normal locrc 
Wald test p-value: condition tumor locrc vs normal locrc 
DataFrame with 5 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat       pvalue
                   <numeric>      <numeric> <numeric> <numeric>    <numeric>
ENSG00000175832.14   777.653        5.90807  0.272208   21.7043 1.86985e-104
ENSG00000164379.7    554.643        6.81300  0.378529   17.9986  1.99725e-72
ENSG00000173898.17   335.155        3.84378  0.224648   17.1103  1.24430e-65
ENSG00000287828.2    117.882        3.74329  0.230517   16.2387  2.68670e-59
ENSG00000111110.13   475.617        2.84100  0.179182   15.8554  1.28977e-56
                           padj
                      <numeric>
ENSG00000175832.14 6.41918e-100
ENSG00000164379.7   3.42828e-68
ENSG00000173898.17  1.42389e-61
ENSG00000287828.2   2.30586e-55
ENSG00000111110.13  8.85559e-53

Let’s check those genes with the strongest down-regulation and up-regulation, respectively.

# Order significant genes with the strongest down-regulation
head(resSig[order(resSig$log2FoldChange), ])
log2 fold change (MLE): condition tumor locrc vs normal locrc 
Wald test p-value: condition tumor locrc vs normal locrc 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat      pvalue
                   <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSG00000181092.11  37.66107       -8.31119  0.921877  -9.01551 1.95963e-19
ENSG00000034971.18  41.97757       -8.07260  0.646220 -12.49204 8.25098e-36
ENSG00000167916.5   26.97167       -6.97326  0.764260  -9.12420 7.22703e-20
ENSG00000286765.2   17.45882       -6.93208  1.268940  -5.46289 4.68436e-08
ENSG00000241158.7   43.04531       -6.81127  0.723505  -9.41427 4.76401e-21
ENSG00000298053.1    7.38361       -6.48845  0.732932  -8.85272 8.54101e-19
                          padj
                     <numeric>
ENSG00000181092.11 6.81602e-18
ENSG00000034971.18 3.88022e-33
ENSG00000167916.5  2.67788e-18
ENSG00000286765.2  3.09496e-07
ENSG00000241158.7  2.12401e-19
ENSG00000298053.1  2.67530e-17
# Order significant genes with the strongest up-regulation
head(resSig[order(resSig$log2FoldChange, decreasing = TRUE), ])
log2 fold change (MLE): condition tumor locrc vs normal locrc 
Wald test p-value: condition tumor locrc vs normal locrc 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat      pvalue
                   <numeric>      <numeric> <numeric> <numeric>   <numeric>
ENSG00000298768.1    65.6359        8.92971  0.826426  10.80521 3.25199e-27
ENSG00000185269.13  222.6813        8.72954  0.647885  13.47390 2.22783e-41
ENSG00000260573.5    48.3050        8.62087  0.801541  10.75537 5.59091e-27
ENSG00000230316.8    38.2773        8.41513  0.973221   8.64667 5.30218e-18
ENSG00000214039.11  301.1991        8.23944  0.561701  14.66873 1.02235e-48
ENSG00000167755.16  159.7259        8.20499  0.846935   9.68786 3.39555e-22
                          padj
                     <numeric>
ENSG00000298768.1  3.75895e-25
ENSG00000185269.13 2.83264e-38
ENSG00000260573.5  6.13214e-25
ENSG00000230316.8  1.47507e-16
ENSG00000214039.11 3.19066e-45
ENSG00000167755.16 1.81008e-20
# Save significant DGE results to file
write.table(resSig, file = "DESeq2_sig_genes_normal_vs_cancer_44_locrc.tab", sep = "\t",
    quote = FALSE, row.names = FALSE)

3.0.1 Visualization of results

3.0.1.1 Individual gene count plot

To visualize the counts for ENSG00000175832.14 gene, which showed the lowest p-adjusted value of 6.41918e-100, we’ll use the plotCounts function over the feature condition to get the normalized counts for this gene.

3.0.1.2 Histogram of p-value frecuencies

3.0.1.3 MA-plot

*An MA-plot (Dudoit et al. 2002) provides a useful overview for the distribution of the estimated coefficients in the model, e.g. the comparisons of interest, across all genes. On the y-axis, the “M” stands for “minus” – subtraction of log values is equivalent to the log of the ratio – and on the x-axis, the “A” stands for “average”. This plot allows us to evaluate fold changes and the distribution around the mean expression.

Before making the MA-plot, we use the lfcShrink function to shrink the log2 fold changes for the comparison of dex treated vs untreated samples. There are three types of shrinkage estimators in DESeq2, which are covered in the DESeq2 vignette. log2 fold shrinkage remove the noise preserving the large differences.*

3.0.1.4 Volcano plot

Volcano plot helps to visualize significant differential expressed genes showing the log2 fold change and the significance (p-adjusted).

To export a .tex table having the top significantly differentiated genes of the LOCRC cohort

                     Gene baseMean log2FoldChange      padj
ENSG00000175832      ETV4    777.7          5.908 6.42e-100
ENSG00000164379     FOXQ1    554.6          6.813  3.43e-68
ENSG00000173898    SPTBN2    335.2          3.844  1.42e-61
ENSG00000287828      <NA>    117.9          3.743  2.31e-55
ENSG00000111110     PPM1H    475.6          2.841  8.86e-53
ENSG00000165816      VWA2    447.5          4.612  4.59e-49
ENSG00000164283      ESM1     72.8          5.750  2.78e-48
ENSG00000015413     DPEP1   1834.3          6.596  1.12e-47
ENSG00000120708     TGFBI   8131.2          2.903  1.07e-46
ENSG00000162073     PAQR4    378.3          1.920  1.99e-45
ENSG00000214039 LINC02418    301.2          8.239  3.19e-45
ENSG00000186462    NAP1L2     52.1         -3.451  4.65e-45
 [ reached 'max' / getOption("max.print") -- omitted 18 rows ]

4 Annotation of results

In this section, we will add the gene symbol, gene name, KEGG pathway, and Entrez ID to our dataset to provide more informative annotations for the DESeq analysis, which currently uses ENSEMBL gene ID nomenclature (e.g., ENSG00000175832).

# Load modules and libraries
BiocManager::install("AnnotationDbi")
BiocManager::install("org.Hs.eg.db")
library("AnnotationDbi")
library("org.Hs.eg.db")
# Check all potential terms that could be retrieved with AnnotationDbi using
# mapsIds()
columns(org.Hs.eg.db)
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     
# Review the most significant genes from our study Previously, we've saved a in
# a .tab file the significant DGE results file =
# 'DESeq2_sig_genes_normal_vs_cancer_44_locrc.tab'
head(resSig)
log2 fold change (MLE): condition tumor locrc vs normal locrc 
Wald test p-value: condition tumor locrc vs normal locrc 
DataFrame with 6 rows and 6 columns
                    baseMean log2FoldChange     lfcSE      stat       pvalue
                   <numeric>      <numeric> <numeric> <numeric>    <numeric>
ENSG00000175832.14   777.653        5.90807  0.272208   21.7043 1.86985e-104
ENSG00000164379.7    554.643        6.81300  0.378529   17.9986  1.99725e-72
ENSG00000173898.17   335.155        3.84378  0.224648   17.1103  1.24430e-65
ENSG00000287828.2    117.882        3.74329  0.230517   16.2387  2.68670e-59
ENSG00000111110.13   475.617        2.84100  0.179182   15.8554  1.28977e-56
ENSG00000165816.14   447.514        4.61193  0.301494   15.2969  8.01942e-53
                           padj
                      <numeric>
ENSG00000175832.14 6.41918e-100
ENSG00000164379.7   3.42828e-68
ENSG00000173898.17  1.42389e-61
ENSG00000287828.2   2.30586e-55
ENSG00000111110.13  8.85559e-53
ENSG00000165816.14  4.58844e-49

We proceed by adding annotations starting from ENSEMBL to Entrez and from there to KEGG pathways, gene symbols and gene names. This is because, all major annotation databases (KEGG, org.Hs.eg.db, clusterProfiler) are indexed by Entrez, so doing the conversion first dramatically reduces NA values and gives more complete annotations.

# Strip version
rownames(resSig) <- sub("\\..*$", "", rownames(resSig))

# Add gene symbol column
resSig$SYMBOL = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "SYMBOL",
    keytype = "ENSEMBL", multiVals = "first")

# Add gene name column
resSig$GENENAME = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "GENENAME",
    keytype = "ENSEMBL", multiVals = "first")

# Add gene Entrez ID column
resSig$ENTREZID = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "ENTREZID",
    keytype = "ENSEMBL", multiVals = "first")

# Add gene EGG functional pathway column
resSig$PATH = mapIds(org.Hs.eg.db, keys = rownames(resSig), column = "PATH", keytype = "ENSEMBL",
    multiVals = "first")

# Add gene ENSEMBL ID column
resSig$ENSEMBL <- rownames(resSig)

# Convert to dataframe
resSig_df <- as.data.frame(resSig)
print(head(resSig_df), 5)
                baseMean log2FoldChange     lfcSE     stat        pvalue
ENSG00000175832 777.6527       5.908067 0.2722077 21.70427 1.869846e-104
ENSG00000164379 554.6433       6.813000 0.3785290 17.99862  1.997251e-72
ENSG00000173898 335.1553       3.843781 0.2246476 17.11027  1.244300e-65
ENSG00000287828 117.8818       3.743293 0.2305172 16.23867  2.686703e-59
                         padj SYMBOL                           GENENAME
ENSG00000175832 6.419180e-100   ETV4 ETS variant transcription factor 4
ENSG00000164379  3.428282e-68  FOXQ1                    forkhead box Q1
ENSG00000173898  1.423894e-61 SPTBN2  spectrin beta, non-erythrocytic 2
ENSG00000287828  2.305863e-55   <NA>                               <NA>
                ENTREZID PATH         ENSEMBL
ENSG00000175832     2118 <NA> ENSG00000175832
ENSG00000164379    94234 <NA> ENSG00000164379
ENSG00000173898     6712 <NA> ENSG00000173898
ENSG00000287828     <NA> <NA> ENSG00000287828
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]
# Save data frame including annotations
write.table(resSig_df, file = ("DESeq2_sig_genes_normal_vs_cancer_44_locrc_annotated.tab"),
    row.names = F, col.names = TRUE, sep = "\t")

They often reflect: biological heterogeneity, batch effects, cell-type composition differences, high expression strength, and technical noise.

4.1 Functional Gene Ontology (GO) Analysis

# Load modules and libraries
BiocManager::install("clusterProfiler")
BiocManager::install("ggnewscale")

library("clusterProfiler")
library("enrichplot")
library("ggnewscale")
[1] 16013    11
                   ID                                   Description Count
GO:0050789 GO:0050789              regulation of biological process  6611
GO:0008152 GO:0008152                             metabolic process  6460
GO:0050794 GO:0050794                regulation of cellular process  6372
GO:0051716 GO:0051716                 cellular response to stimulus  4242
GO:0071840 GO:0071840 cellular component organization or biogenesis  4223
GO:0048518 GO:0048518     positive regulation of biological process  3765
GO:0007154 GO:0007154                            cell communication  3693
GO:0023052 GO:0023052                                     signaling  3680
GO:0048856 GO:0048856              anatomical structure development  3535
GO:0048522 GO:0048522       positive regulation of cellular process  3533
GO:0007165 GO:0007165                           signal transduction  3388
GO:0048519 GO:0048519     negative regulation of biological process  3263
            GeneRatio
GO:0050789 6611/12698
GO:0008152 6460/12698
GO:0050794 6372/12698
GO:0051716 4242/12698
GO:0071840 4223/12698
GO:0048518 3765/12698
GO:0007154 3693/12698
GO:0023052 3680/12698
GO:0048856 3535/12698
GO:0048522 3533/12698
GO:0007165 3388/12698
GO:0048519 3263/12698
 [ reached 'max' / getOption("max.print") -- omitted 13 rows ]

# Convert to dataframe
ego_df <- as.data.frame(ego)

# Visualize the last columns
head(ego_df, 25)
                   ID                          Description GeneRatio   BgRatio
GO:0042254 GO:0042254                  ribosome biogenesis  262/9826 318/18860
GO:0022613 GO:0022613 ribonucleoprotein complex biogenesis  373/9826 496/18860
GO:0006364 GO:0006364                      rRNA processing  186/9826 223/18860
GO:0016072 GO:0016072               rRNA metabolic process  216/9826 270/18860
           RichFactor FoldEnrichment    zScore       pvalue     p.adjust
GO:0042254  0.8238994       1.581390 10.904627 4.319790e-30 2.754730e-26
GO:0022613  0.7520161       1.443418 10.437057 7.951835e-27 2.535443e-23
GO:0006364  0.8340807       1.600932  9.414511 5.695650e-23 1.210705e-19
GO:0016072  0.8000000       1.535518  9.243234 8.029223e-22 1.280059e-18
                 qvalue
GO:0042254 2.114878e-26
GO:0022613 1.946525e-23
GO:0006364 9.294902e-20
GO:0016072 9.827346e-19
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     geneID
GO:0042254                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       DKC1/EXOSC2/UTP25/HEATR1/NLE1/NOP58/DDX55/RPL7L1/BOP1/DDX31/NAT10/RIOK3/LYAR/NOL8/GTF3A/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/UTP14A/UTP20/BRIX1/PRKDC/PA2G4/GAR1/SUV39H1/RRP15/KAT2B/SPOUT1/NOP14/DDX27/URB1/LAS1L/NPM3/DCAF13/TMA16/MTERF3/RRP9/NOB1/NSUN5/PDCD11/DDX18/WDR36/PNO1/RIOK1/GTPBP4/LTV1/RRS1/AIRIM/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NPM1/NOL6/RPF2/DDX56/FBL/ERCC2/RIOX2/UTP4/RPUSD4/EIF1AX/NOLC1/GTPBP10/DHX37/RAN/NOP2/UTP23/PAK1IP1/SDAD1/NHP2/DDX21/MYBBP1A/MPHOSPH10/NOL10/TSR1/XPO1/URB2/BMS1/PES1/MTG2/RRP1B/RCL1/NOP56/UTP15/MRM2/SIRT7/RPUSD1/BUD23/DIS3/GEMIN4/WDR74/EBNA1BP2/CUL4A/PPAN/TRMT61B/CHD7/DDX28/GNL3L/NOM1/TFB1M/MRTO4/ZNHIT6/DDX52/DIMT1/SURF6/MAK16/NOC4L/NOL11/RRP1/NOC2L/AATF/POP7/RPS21/HEATR3/EXOSC9/IMP4/FASTKD2/C1QBP/UTP6/PWP1/NIP7/DDX54/XRN2/EMG1/EIF5B/DHX30/KRI1/NUDT16/NUP88/DDX51/ZCCHC4/REXO4/TRMT2B/METTL5/TRMT112/WDR18/KRR1/NAF1/NMD3/NOP16/MDN1/RRP8/RPS19/AFG2A/TFB2M/METTL16/NOL7/MRPS2/MALSU1/LSM6/XRCC5/ISG20/REXO5/RPS4X/RPL10L/RPS7/GNL2/ZNF593/ISG20L2/DDX47/EIF2A/GRWD1/EIF5/RPL7A/RPL35/RBIS/ESF1/PELP1/CUL4B/TBL3/NSUN3/UTP3/RPS6/RPL14/UTP11/ABT1/RPS5/ERI1/RPLP0/MYG1/SNU13/RPL26L1/EFL1/RPLP0P6/RPL7/RRP7BP/RBFA/USP36/EXOSC4/PIN4/RPUSD2/ERAL1/RPS15/FTSJ3/FSAF1/RPL38/DDX49/RPS16/MPHOSPH6/EIF6/MTREX/LSG1/MRM1/RPSA/DNTTIP2/CDKN2A/POP5/GLUL/RRP7A/RCC1L/EIF4A3/PIH1D2/RPL27/SBDS/RPS9/RPS11/RPS12/RPL24/RPL35A/RPS17/DROSHA/RPS27A/RRN3/ZNF622/MCAT/EXOSC6/EXOSC1/RPL5/RPS15A/RPS24/TSR2/AFG2B/RPS8/NSA2/RPS23/RPS14/WBP11/NGDN/NOA1/FCF1/RBM34/ZNHIT3/RSL24D1/RIOK2/RPP38/TENT4B/SLX9/RPS25/RRP36/RPS28/RPP30/RPSA2/MRM3
GO:0022613 RUVBL1/DKC1/EXOSC2/UTP25/NUFIP1/PRPF19/WDR77/HEATR1/NLE1/NOP58/DDX55/RPL7L1/BOP1/DDX31/NAT10/RIOK3/LYAR/NOL8/GTF3A/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/SRPK1/UTP14A/UTP20/BRIX1/PRKDC/MCTS2/PA2G4/GAR1/SUV39H1/GEMIN5/RRP15/KAT2B/SPOUT1/RUVBL2/NOP14/DDX27/URB1/LAS1L/NPM3/DCAF13/EIF3B/EIF2S3/TMA16/MTERF3/HSP90AB1/RRP9/NOB1/NSUN5/DENR/SF3B3/PDCD11/SNRPF/DDX18/WDR36/PNO1/ATR/RIOK1/GTPBP4/LTV1/RRS1/AIRIM/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NPM1/NOL6/RPF2/DDX56/FBL/ERCC2/RIOX2/SRSF5/UTP4/CLNS1A/RPUSD4/EIF1AX/NOLC1/GTPBP10/DHX37/RAN/NOP2/EIF2S2/UTP23/PAK1IP1/SDAD1/NHP2/RBMX2/DDX21/SNRPB/MYBBP1A/MPHOSPH10/NOL10/TGS1/TSR1/XPO1/URB2/BMS1/PES1/MTG2/NOPCHAP1/GEMIN6/CELF4/RRP1B/RCL1/NOP56/DDX20/UTP15/MRM2/SIRT7/RPUSD1/BUD23/TARBP2/DIS3/GEMIN4/SRSF1/WDR74/EBNA1BP2/CUL4A/PPAN/PRPF6/TRMT61B/CHD7/DDX28/SRSF6/GNL3L/PTGES3/NOM1/TFB1M/MRTO4/GEMIN8/ZNHIT6/DDX52/USP39/DIMT1/SURF6/AAR2/MAK16/NOC4L/NOL11/RRP1/NOC2L/AATF/POP7/RPS21/HEATR3/PRPF31/EXOSC9/IMP4/FASTKD2/C1QBP/UTP6/SNRPA1/PWP1/PRPF3/NIP7/DDX54/XRN2/SNRPE/EMG1/SF3A3/EIF5B/DHX30/LSM2/KRI1/STRAP/NUDT16/NUP88/DDX51/SNRNP200/ZCCHC4/REXO4/SNRPC/TRMT2B/METTL5/SNRPB2/TRMT112/SNRPD2/WDR18/KRR1/SRPK2/NAF1/HSP90AB3P/U2AF2/NMD3/SNRPG/NOP16/MDN1/RRP8/EIF3M/SNRPD1/RPS19/MCTS1/AFG2A/TFB2M/METTL16/NOL7/DHX9/SRSF9/MRPS2/MALSU1/LSM6/XRCC5/ISG20/DDX39B/EIF3J/REXO5/RPS4X/RPL10L/RPS7/GNL2/HSP90AA1/PRKRA/SRSF10/ZNF593/KHDC4/ISG20L2/DDX47/EIF2A/COIL/GRWD1/SMN1/EIF3G/DDX1/LUC7L2/EIF5/CELF6/EIF3I/RPL7A/GCFC2/RPL35/RBIS/ESF1/PELP1/CUL4B/DDX46/EIF3H/TBL3/EIF3D/EIF3E/NSUN3/UTP3/AGO2/ZRSR2/PRMT5/RPS6/RPL14/EIF4B/ISY1/UTP11/ABT1/RPS5/SNRPD3/ERI1/RPLP0/MYG1/SNU13/SART3/HTATSF1/RPL26L1/EFL1/SHQ1/RPLP0P6/RPL7/NOL3/PRMT7/SNIP1/RRP7BP/RBFA/USP36/XAB2/NCBP1/EXOSC4/EIF3C/PIN4/RPUSD2/ERAL1/RPS15/FTSJ3/FSAF1/RPL38/DDX49/RPS16/MPHOSPH6/EIF6/TAF9/MTREX/LSG1/MRM1/SF3A2/RPSA/DNTTIP2/CDKN2A/POP5/GLUL/RRP7A/RCC1L/EIF4A3/CELF3/GEMIN2/PIH1D2/RPL27/SBDS/RPS9/RPS11/RPS12/RPL24/RPL35A/SLU7/RPS17/DROSHA/SRPK3/PRP4K/RPL13A/RPS27A/RRN3/ZNF622/MCAT/EXOSC6/EXOSC1/DDX23/RPL5/RPS15A/RPS24/TSR2/AFG2B/RPS8/LUC7L/LUC7L3/NSA2/GEMIN7/PHF5A/RSRP1/PUF60/EIF3A/SF3B6/RPS23/RPS14/BUD13/WBP11/NGDN/NOA1/FCF1/DDX42/RBM34/ZNHIT3/SFSWAP/RSL24D1/DICER1/PRPF39/RIOK2/PSIP1/LSM4/RPP38/TENT4B/CELF5/SLX9/RPS25/RRP36/RPS28/RPP30/RPSA2/MRM3
GO:0006364                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              DKC1/EXOSC2/UTP25/HEATR1/NOP58/DDX55/RPL7L1/BOP1/NAT10/RIOK3/LYAR/NOL8/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/UTP14A/UTP20/BRIX1/PRKDC/PA2G4/GAR1/SUV39H1/RRP15/KAT2B/SPOUT1/NOP14/DDX27/URB1/LAS1L/NPM3/DCAF13/RRP9/NOB1/NSUN5/PDCD11/DDX18/WDR36/RIOK1/GTPBP4/RRS1/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NOL6/RPF2/DDX56/FBL/ERCC2/UTP4/RPUSD4/NOLC1/DHX37/NOP2/UTP23/PAK1IP1/NHP2/DDX21/MPHOSPH10/NOL10/TSR1/BMS1/PES1/RRP1B/RCL1/NOP56/UTP15/MRM2/SIRT7/RPUSD1/BUD23/DIS3/GEMIN4/WDR74/EBNA1BP2/PPAN/TRMT61B/CHD7/TFB1M/MRTO4/ZNHIT6/DDX52/DIMT1/MAK16/NOC4L/NOL11/RRP1/POP7/RPS21/EXOSC9/IMP4/UTP6/PWP1/DDX54/XRN2/EMG1/KRI1/NUDT16/DDX51/ZCCHC4/REXO4/TRMT2B/METTL5/TRMT112/WDR18/KRR1/NAF1/RRP8/RPS19/TFB2M/METTL16/NOL7/LSM6/ISG20/REXO5/RPS7/DDX47/RPL7A/RPL35/ESF1/PELP1/TBL3/NSUN3/UTP3/RPS6/RPL14/UTP11/ABT1/ERI1/MYG1/SNU13/RPL7/RRP7BP/RBFA/USP36/EXOSC4/PIN4/RPUSD2/RPS15/FTSJ3/DDX49/RPS16/MPHOSPH6/EIF6/MTREX/MRM1/CDKN2A/POP5/RRP7A/EIF4A3/PIH1D2/RPL27/SBDS/RPL35A/RPS17/DROSHA/EXOSC6/EXOSC1/RPL5/RPS24/TSR2/RPS8/NSA2/RPS14/WBP11/NGDN/FCF1/RBM34/ZNHIT3/RIOK2/RPP38/TENT4B/SLX9/RPS25/RRP36/RPS28/RPP30/MRM3
GO:0016072                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  DKC1/EXOSC2/UTP25/POLR1G/HEATR1/POLR1B/NOP58/DDX55/RPL7L1/BOP1/NAT10/RIOK3/LYAR/NOL8/GTF3A/WDR75/DDX10/WDR43/EXOSC5/WDR12/RPP40/WDR3/UTP14A/UTP20/BRIX1/PRKDC/PA2G4/GAR1/SUV39H1/RRP15/KAT2B/SPOUT1/NOP14/DDX27/URB1/LAS1L/NPM3/POLR1F/DCAF13/RRP9/NOB1/NSUN5/TCOF1/PDCD11/DDX18/WDR36/RIOK1/GTPBP4/RRS1/POLR1A/NCL/RNASEL/EXOSC8/UTP18/BYSL/SRFBP1/EXOSC3/EXOSC7/WDR46/NOL6/RPF2/DDX56/FBL/ERCC2/UTP4/RPUSD4/NIFK/NOLC1/DHX37/NOP2/UTP23/PAK1IP1/NHP2/DDX21/MPHOSPH10/NOL10/TSR1/BMS1/PES1/TAF1B/DDX11/RRP1B/RCL1/NOP56/UTP15/MRM2/SIRT7/POLR1D/GTF3C3/RPUSD1/BUD23/DIS3/GEMIN4/WDR74/EBNA1BP2/PPAN/TRMT61B/CHD7/TFB1M/MRTO4/ZNHIT6/DDX52/GTF3C4/DIMT1/MAK16/NOC4L/NOL11/RRP1/POP7/RPS21/EXOSC9/IMP4/UTP6/PWP1/DDX54/XRN2/EMG1/KRI1/MYC/NUDT16/DDX51/ZCCHC4/REXO4/TP53/TRMT2B/METTL5/POLR1E/TRMT112/WDR18/KRR1/NAF1/RRP8/RPS19/TFB2M/METTL16/NOL7/LSM6/ISG20/REXO5/RPS7/MARS1/XRN1/DDX47/RPL7A/CAVIN1/RPL35/ESF1/PELP1/TBL3/NSUN3/UTP3/RPS6/RPL14/UTP11/ABT1/ERI1/MYG1/SNU13/MACROH2A2/RPL7/SPIN1/RRP7BP/MACROH2A1/RBFA/USP36/EXOSC4/SMARCA4/PIN4/RPUSD2/PELO/RPS15/FTSJ3/MAPT/DDX49/RPS16/MPHOSPH6/EIF6/ERN2/MTREX/DEDD2/MRM1/GTF3C5/CDKN2A/POP5/RRP7A/EIF4A3/PIH1D2/RPL27/SBDS/RPL35A/RPS17/DROSHA/EXOSC6/EXOSC1/DEDD/RPL5/RPS24/TSR2/RPS8/NSA2/RPS14/WBP11/NGDN/FCF1/RBM34/ZNHIT3/RIOK2/RPP38/TENT4B/SLX9/RPS25/RRP36/MTOR/RPS28/RPP30/MRM3
           Count
GO:0042254   262
GO:0022613   373
GO:0006364   186
GO:0016072   216
 [ reached 'max' / getOption("max.print") -- omitted 21 rows ]
# Get dataframe dimmensions
dim(ego_df)
[1] 869  12

4.2 Gene-Concept Network and Enrichment Map

The Gene-Concept Network displays which genes are associated with which biological processes, making it easy to see gene-function relationships. Meanwhile, the enrichment map groups similar terms together. Terms that share genes are connected, making related functions easy to spot.

# Load modules and libraries
library(DOSE)
# Convert gene ID to Symbol
egox <- setReadable(ego, "org.Hs.eg.db", "ENTREZID")

# Generate geneList with Log2FoldChange data
geneList <- resSig$log2FoldChange
names(geneList) <- as.character(unique(resSig$ENTREZID))
geneList <- sort(geneList, decreasing = TRUE)

head(geneList)
     5333    147111      4857     10004 100190940    401093 
 8.929714  8.729543  8.620869  8.415127  8.239437  8.204987 
# Network plot of enriched terms CategorySize can be scaled by 'pvalue' or
# 'geneNum'
p1 <- cnetplot(egox, categorySize = "pvalue", foldChange = geneList)

# Display the plot
print(p1)

# Calculate pairwise similarities of enriched terms
egox_pairwise <- pairwise_termsim(egox)

# Creates an enrichment map visualization from pairwise enrichment analysis
# results
p2 <- emapplot(egox_pairwise, layout = "kk")

# Display the plot
print(p2)

# Plot Gene-Concept Network and Enrichment Map in a PDF file

# Plot 1
pdf("gene_concept_network_plot.pdf", width = 12, height = 10)
print(p1)
dev.off()
quartz_off_screen 
                2 
# Plot 2
pdf("enrichment_map.pdf", width = 12, height = 10)
print(p2)
dev.off()
quartz_off_screen 
                2 
# # Start PDF device pdf(file=paste('Merge_GO_Gene_Set_Enrichment_Analysis',
# 'CC', '.pdf', sep='_'), width=25, height=10) # Create the plot
# cowplot::plot_grid(p1, p2, ncol=2, labels=LETTERS[1:3], rel_widths=c(.8, .8,
# 1.2)) # Close PDF device dev.off()

5 Comparison of results using Venn diagrams

# Get results from the DESeq2 pipeline by contrast
tumor_vs_normal_44_dds <- results(dds_DGE, contrast = c("condition", "tumor_locrc",
    "normal_locrc"))

summary(tumor_vs_normal_44_dds)

out of 37216 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)       : 9180, 25%
LFC < 0 (down)     : 8837, 24%
outliers [1]       : 0, 0%
low counts [2]     : 2911, 7.8%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Get differentially expressed genes by contrast tumor_locrc_vs_normal_locrc:

log2 fold change (MLE): condition tumor_locrc vs normal_locrc 
Wald test p-value: condition tumor locrc vs normal locrc 
DataFrame with 5 rows and 5 columns
                   log2FoldChange     lfcSE      stat       pvalue         padj
                        <numeric> <numeric> <numeric>    <numeric>    <numeric>
ENSG00000175832.14        5.90807  0.272208   21.7043 1.86985e-104 6.41918e-100
ENSG00000164379.7         6.81300  0.378529   17.9986  1.99725e-72  3.42828e-68
ENSG00000173898.17        3.84378  0.224648   17.1103  1.24430e-65  1.42389e-61
ENSG00000287828.2         3.74329  0.230517   16.2387  2.68670e-59  2.30586e-55
ENSG00000111110.13        2.84100  0.179182   15.8554  1.28977e-56  8.85559e-53

Get differentially expressed genes by contrast tumor_eocrc_vs_normal_eocrc:

                    baseMean log2FoldChange     lfcSE     stat       pvalue
ENSG00000105464.5  140.90415       4.172698 0.3545970 11.76744 5.744397e-32
ENSG00000164283.13 109.78264       5.232688 0.4532636 11.54447 7.872463e-31
ENSG00000175832.14 726.72663       4.370810 0.3983460 10.97239 5.188033e-28
ENSG00000107159.15 388.76790       5.636863 0.5534101 10.18569 2.297276e-24
ENSG00000163347.6  625.01831       4.503009 0.4417764 10.19296 2.131718e-24
ENSG00000170373.9  396.55271       7.034988 0.6893200 10.20569 1.869817e-24
ENSG00000172031.8   75.79835       4.432616 0.4332061 10.23212 1.423717e-24
ENSG00000236081.2  137.80967       4.789244 0.4705045 10.17895 2.461911e-24
                           padj
ENSG00000105464.5  1.920180e-27
ENSG00000164283.13 1.315764e-26
ENSG00000175832.14 5.780679e-24
ENSG00000107159.15 1.028679e-20
ENSG00000163347.6  1.028679e-20
ENSG00000170373.9  1.028679e-20
ENSG00000172031.8  1.028679e-20
ENSG00000236081.2  1.028679e-20
 [ reached 'max' / getOption("max.print") -- omitted 6208 rows ]
                     baseMean log2FoldChange     lfcSE       stat       pvalue
ENSG00000000003.17 2857.21111      0.1616879 0.2714131  0.5957264 5.513580e-01
ENSG00000000005.6    51.21913      1.3275579 0.4465390  2.9729944 2.949099e-03
ENSG00000000419.15 1670.06587      1.2283879 0.2211246  5.5551835 2.773202e-08
ENSG00000000457.15  485.04283     -0.3559857 0.1470883 -2.4202180 1.551120e-02
ENSG00000000460.18  223.44756      1.3877793 0.2065736  6.7180869 1.841258e-11
ENSG00000000938.14  259.37230      0.2577540 0.3653517  0.7054955 4.805019e-01
ENSG00000000971.18 1250.84010     -1.1620404 0.2475734 -4.6937199 2.682813e-06
ENSG00000001036.15 1874.82106      0.2042679 0.1358121  1.5040486 1.325689e-01
                           padj
ENSG00000000003.17 6.900129e-01
ENSG00000000005.6  1.271338e-02
ENSG00000000419.15 7.869256e-07
ENSG00000000457.15 4.838946e-02
ENSG00000000460.18 1.667961e-09
ENSG00000000938.14 6.286395e-01
ENSG00000000971.18 3.666328e-05
ENSG00000001036.15 2.519977e-01
 [ reached 'max' / getOption("max.print") -- omitted 37857 rows ]
class: DESeqDataSet 
dim: 37865 42 
metadata(9): abstract Accession ... Relevance version
assays(6): counts mu ... replaceCounts replaceCooks
rownames(37865): ENSG00000000003.17 ENSG00000000005.6 ...
  ENSG00000310592.1 ENSG00000310593.1
rowData names(26): transcript_id gene_id ... maxCooks replace
colnames(42): srr17866817 srr17866818 ... srr17866857 srr17866858
colData names(34): age_at_surgery tissue ... sizeFactor replaceable

5.1 Generate Venn Diagrams

  EOCRC_UP LOCRC_NS Counts
1        0        0  30699
2        0        1   3541
3        1        0   1274
4        1        1     19
attr(,"class")
[1] "VennCounts"

  EOCRC_DOWN LOCRC_NS Counts
1          0        0  30804
2          0        1   3539
3          1        0   1169
4          1        1     21
attr(,"class")
[1] "VennCounts"

Extract genes significantly differentially expressed in EOCRC

[1] 19
 [1] "ENSG00000085741.15" "ENSG00000099194.7"  "ENSG00000102854.17"
 [4] "ENSG00000105825.15" "ENSG00000105835.14" "ENSG00000106236.5" 
 [7] "ENSG00000136689.20" "ENSG00000140279.13" "ENSG00000146054.19"
[10] "ENSG00000150337.15" "ENSG00000159167.12" "ENSG00000165171.12"
[13] "ENSG00000167774.2"  "ENSG00000176971.4"  "ENSG00000181458.11"
[16] "ENSG00000185201.18" "ENSG00000205609.14" "ENSG00000243081.3" 
[19] "ENSG00000274585.2" 
[1] 21
 [1] "ENSG00000047648.23" "ENSG00000110076.21" "ENSG00000118523.7" 
 [4] "ENSG00000126353.5"  "ENSG00000126803.10" "ENSG00000127954.14"
 [7] "ENSG00000128709.13" "ENSG00000161405.18" "ENSG00000169507.10"
[10] "ENSG00000184304.17" "ENSG00000185905.5"  "ENSG00000198910.15"
[13] "ENSG00000211650.2"  "ENSG00000211935.3"  "ENSG00000211942.3" 
[16] "ENSG00000211959.2"  "ENSG00000214548.20" "ENSG00000224650.2" 
[19] "ENSG00000243290.3"  "ENSG00000274576.2"  "ENSG00000276775.1" 

Create a table with the genes significantly differentially expressed in EOCRC

       ENSEMBL_ID EOCRC_baseMean EOCRC_log2FC EOCRC_padj LOCRC_baseMean
1 ENSG00000085741        182.214        1.820      0.001        111.907
2 ENSG00000099194       4216.099        1.099      0.025       6498.227
3 ENSG00000102854        760.166        2.326      0.000        287.297
4 ENSG00000105825         73.805        1.257      0.023        211.037
  LOCRC_log2FC LOCRC_padj Direction  Gene ENTREZID  PATH
1        0.544      0.257        Up WNT11     7481 04310
2        0.657      0.238        Up   SCD     6319 01040
3        0.497      0.309        Up  MSLN    10232  <NA>
4        0.668      0.382        Up TFPI2     7980  <NA>
 [ reached 'max' / getOption("max.print") -- omitted 36 rows ]

baseMean gives the mean counts for the gene in each cohort. log2FC is the log2 fold change in tumors compared to normal samples. padj is the adjusted p-value. Values rounded to 3 decimal points.

6 Revamped GO terms and Enrichment plots

                   ID                                   Description Count
GO:0050789 GO:0050789              regulation of biological process    19
GO:0050794 GO:0050794                regulation of cellular process    18
GO:0008152 GO:0008152                             metabolic process    16
GO:0006955 GO:0006955                               immune response    14
GO:0048518 GO:0048518     positive regulation of biological process    14
GO:0007154 GO:0007154                            cell communication    13
GO:0048522 GO:0048522       positive regulation of cellular process    13
GO:0071840 GO:0071840 cellular component organization or biogenesis    13
GO:0023052 GO:0023052                                     signaling    13
GO:0007165 GO:0007165                           signal transduction    12
GO:0051716 GO:0051716                 cellular response to stimulus    12
GO:0048856 GO:0048856              anatomical structure development    12
           GeneRatio
GO:0050789     19/40
GO:0050794     18/40
GO:0008152     16/40
GO:0006955     14/40
GO:0048518     14/40
GO:0007154     13/40
GO:0048522     13/40
GO:0071840     13/40
GO:0023052     13/40
GO:0007165     12/40
GO:0051716     12/40
GO:0048856     12/40
 [ reached 'max' / getOption("max.print") -- omitted 13 rows ]

                   ID
GO:0016064 GO:0016064
GO:0019724 GO:0019724
GO:0002449 GO:0002449
GO:0002460 GO:0002460
                                                                                                                         Description
GO:0016064                                                                                   immunoglobulin mediated immune response
GO:0019724                                                                                                  B cell mediated immunity
GO:0002449                                                                                              lymphocyte mediated immunity
GO:0002460 adaptive immune response based on somatic recombination of immune receptors built from immunoglobulin superfamily domains
           GeneRatio   BgRatio RichFactor FoldEnrichment    zScore       pvalue
GO:0016064      7/35 203/18860 0.03448276      18.581281 10.859341 7.860296e-08
GO:0019724      7/35 206/18860 0.03398058      18.310680 10.771783 8.689659e-08
GO:0002449      7/35 388/18860 0.01804124       9.721649  7.484855 6.043876e-06
GO:0002460      7/35 402/18860 0.01741294       9.383085  7.325722 7.619376e-06
               p.adjust       qvalue
GO:0016064 4.001588e-05 3.338659e-05
GO:0019724 4.001588e-05 3.338659e-05
GO:0002449 1.754361e-03 1.463722e-03
GO:0002460 1.754361e-03 1.463722e-03
                                                               geneID Count
GO:0016064 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4     7
GO:0019724 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4     7
GO:0002449 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4     7
GO:0002460 FCGR1A/IGHV1-3/IGHV3-13/IGHV4-39/IGHV3-74/IGHV2-70/IGHV4-4     7
 [ reached 'max' / getOption("max.print") -- omitted 7 rows ]
# Convert to dataframe
ego_df <- as.data.frame(ego)

# Visualize the top cateogories
ego_head15 <- head(ego_df, 15)

# Save file
write.csv(ego_head15, file = "ego_top15_GO_BP.csv", row.names = FALSE)

6.1 Gene-Concept Network and Enrichment Map

The Gene-Concept Network displays which genes are associated with which biological processes, making it easy to see gene-function relationships. Meanwhile, the enrichment map groups similar terms together. Terms that share genes are connected, making related functions easy to spot.

# Load modules and libraries
library(DOSE)
# Convert gene ID to Symbol
egox <- setReadable(ego, "org.Hs.eg.db", "ENTREZID")

# Generate geneList with Log2FoldChange data
geneList <- final_table$EOCRC_log2FC
names(geneList) <- as.character(unique(final_table$ENTREZID))
geneList <- sort(geneList, decreasing = TRUE)

head(geneList)
 4885 10232  7481  6781  6066 50506 
2.551 2.326 1.820 1.735 1.648 1.513 
# Network plot of enriched terms CategorySize can be scaled by 'pvalue' or
# 'geneNum'
p1 <- cnetplot(egox, categorySize = "pvalue", foldChange = geneList)

# Display the plot
print(p1)

# Calculate pairwise similarities of enriched terms
egox_pairwise <- pairwise_termsim(egox)

# Creates an enrichment map visualization from pairwise enrichment analysis
# results
p2 <- emapplot(egox_pairwise, layout = "kk", node_label_size = 3)

# Display the plot
print(p2)

# Plot Gene-Concept Network and Enrichment Map in a PDF file

# Plot 1
pdf("gene_concept_network_plot_30genes.pdf", width = 6, height = 4)
print(p1)
dev.off()
quartz_off_screen 
                2 
# Plot 2
pdf("enrichment_map_30genes.pdf", width = 6, height = 4)
print(p2)
dev.off()
quartz_off_screen 
                2 
# # Start PDF device pdf(file=paste('Merge_GO_Gene_Set_Enrichment_Analysis',
# 'CC', '.pdf', sep='_'), width=25, height=10) # Create the plot
# cowplot::plot_grid(p1, p2, ncol=2, labels=LETTERS[1:3], rel_widths=c(.8, .8,
# 1.2)) # Close PDF device dev.off()

7 R Session Information

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.5.1 (2025-06-13)
 os       macOS Ventura 13.7.3
 system   x86_64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Madrid
 date     2026-02-03
 pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64/ (via rmarkdown)
 quarto   1.5.57 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version    date (UTC) lib source
 abind                  1.4-8      2024-09-12 [1] CRAN (R 4.5.0)
 affy                   1.88.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 affyio                 1.80.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 AnnotationDbi        * 1.72.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 ape                    5.8-1      2024-12-16 [1] CRAN (R 4.5.0)
 apeglm               * 1.32.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 aplot                  0.2.9      2025-09-12 [1] CRAN (R 4.5.1)
 base64enc              0.1-3      2015-07-28 [1] CRAN (R 4.5.0)
 bbmle                  1.0.25.1   2023-12-09 [1] CRAN (R 4.5.0)
 bdsmatrix              1.3-7      2024-03-02 [1] CRAN (R 4.5.0)
 beeswarm               0.4.0      2021-06-01 [1] CRAN (R 4.5.0)
 Biobase              * 2.70.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 BiocGenerics         * 0.56.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 BiocManager          * 1.30.27    2025-11-14 [1] CRAN (R 4.5.1)
 BiocParallel           1.44.0     2025-10-30 [1] Bioconductor 3.22 (R 4.5.1)
 Biostrings             2.78.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 bit                    4.6.0      2025-03-06 [1] CRAN (R 4.5.0)
 bit64                  4.6.0-1    2025-01-16 [1] CRAN (R 4.5.0)
 blob                   1.2.4      2023-03-17 [1] CRAN (R 4.5.0)
 bslib                  0.9.0      2025-01-30 [1] CRAN (R 4.5.0)
 cachem                 1.1.0      2024-05-16 [1] CRAN (R 4.5.0)
 cli                    3.6.5      2025-04-23 [1] CRAN (R 4.5.0)
 cluster                2.1.8.1    2025-03-12 [1] CRAN (R 4.5.1)
 clusterProfiler      * 4.18.4     2025-12-15 [1] Bioconductor 3.22 (R 4.5.2)
 coda                   0.19-4.1   2024-01-31 [1] CRAN (R 4.5.0)
 codetools              0.2-20     2024-03-31 [1] CRAN (R 4.5.1)
 corrplot             * 0.95       2024-10-14 [1] CRAN (R 4.5.0)
 cowplot                1.2.0      2025-07-07 [1] CRAN (R 4.5.1)
 crayon                 1.5.3      2024-06-20 [1] CRAN (R 4.5.0)
 data.table             1.17.8     2025-07-10 [1] CRAN (R 4.5.1)
 DBI                    1.2.3      2024-06-02 [1] CRAN (R 4.5.0)
 DelayedArray           0.36.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 DESeq2               * 1.50.2     2025-11-17 [1] Bioconductor 3.22 (R 4.5.2)
 devtools               2.4.6      2025-10-03 [1] CRAN (R 4.5.1)
 digest                 0.6.39     2025-11-19 [1] CRAN (R 4.5.1)
 DOSE                 * 4.4.0      2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 dplyr                * 1.1.4      2023-11-17 [1] CRAN (R 4.5.0)
 ellipsis               0.3.2      2021-04-29 [1] CRAN (R 4.5.0)
 emdbook                1.3.14     2025-07-23 [1] CRAN (R 4.5.1)
 enrichplot           * 1.30.4     2025-12-01 [1] Bioconductor 3.22 (R 4.5.2)
 evaluate               1.0.5      2025-08-27 [1] CRAN (R 4.5.1)
 factoextra           * 1.0.7      2020-04-01 [1] CRAN (R 4.5.0)
 farver                 2.1.2      2024-05-13 [1] CRAN (R 4.5.0)
 fastmap                1.2.0      2024-05-15 [1] CRAN (R 4.5.0)
 fastmatch              1.1-6      2024-12-23 [1] CRAN (R 4.5.0)
 fgsea                  1.36.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 fontBitstreamVera      0.1.1      2017-02-01 [1] CRAN (R 4.5.0)
 fontLiberation         0.1.0      2016-10-15 [1] CRAN (R 4.5.0)
 fontquiver             0.2.1      2017-02-01 [1] CRAN (R 4.5.0)
 formatR                1.14       2023-01-17 [1] CRAN (R 4.5.0)
 fs                     1.6.6      2025-04-12 [1] CRAN (R 4.5.0)
 gdtools                0.4.4      2025-10-06 [1] CRAN (R 4.5.1)
 generics             * 0.1.4      2025-05-09 [1] CRAN (R 4.5.0)
 GenomicRanges        * 1.62.1     2025-12-08 [1] Bioconductor 3.22 (R 4.5.2)
 ggbeeswarm           * 0.7.3      2025-11-29 [1] CRAN (R 4.5.1)
 ggforce                0.5.0      2025-06-18 [1] CRAN (R 4.5.0)
 ggfortify            * 0.4.19     2025-07-27 [1] CRAN (R 4.5.1)
 ggfun                  0.2.0      2025-07-15 [1] CRAN (R 4.5.1)
 ggiraph                0.9.2      2025-10-07 [1] CRAN (R 4.5.1)
 ggnewscale           * 0.5.2      2025-06-20 [1] CRAN (R 4.5.0)
 ggplot2              * 4.0.1      2025-11-14 [1] CRAN (R 4.5.1)
 ggplotify              0.1.3      2025-09-20 [1] CRAN (R 4.5.1)
 ggrepel              * 0.9.6      2024-09-07 [1] CRAN (R 4.5.0)
 ggtangle               0.0.9      2025-11-30 [1] CRAN (R 4.5.1)
 ggtree                 4.0.3      2025-12-18 [1] Bioconductor 3.22 (R 4.5.2)
 glue                   1.8.0      2024-09-30 [1] CRAN (R 4.5.0)
 GO.db                  3.22.0     2025-11-21 [1] Bioconductor
 GOSemSim               2.36.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 gridExtra            * 2.3        2017-09-09 [1] CRAN (R 4.5.0)
 gridGraphics           0.5-1      2020-12-13 [1] CRAN (R 4.5.0)
 gson                   0.1.0      2023-03-07 [1] CRAN (R 4.5.0)
 gtable                 0.3.6      2024-10-25 [1] CRAN (R 4.5.0)
 hexbin               * 1.28.5     2024-11-13 [1] CRAN (R 4.5.0)
 htmltools              0.5.9      2025-12-04 [1] CRAN (R 4.5.1)
 htmlwidgets            1.6.4      2023-12-06 [1] CRAN (R 4.5.0)
 httr                   1.4.7      2023-08-15 [1] CRAN (R 4.5.0)
 igraph                 2.2.1      2025-10-27 [1] CRAN (R 4.5.1)
 IRanges              * 2.44.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 jquerylib              0.1.4      2021-04-26 [1] CRAN (R 4.5.0)
 jsonlite               2.0.0      2025-03-27 [1] CRAN (R 4.5.0)
 KEGGREST               1.50.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 knitr                * 1.51       2025-12-20 [1] CRAN (R 4.5.1)
 labeling               0.4.3      2023-08-29 [1] CRAN (R 4.5.0)
 lattice                0.22-7     2025-04-02 [1] CRAN (R 4.5.1)
 lazyeval               0.2.2      2019-03-15 [1] CRAN (R 4.5.0)
 lifecycle              1.0.4      2023-11-07 [1] CRAN (R 4.5.0)
 limma                * 3.66.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 locfit                 1.5-9.12   2025-03-05 [1] CRAN (R 4.5.0)
 magrittr               2.0.4      2025-09-12 [1] CRAN (R 4.5.1)
 MASS                   7.3-65     2025-02-28 [1] CRAN (R 4.5.1)
 Matrix                 1.7-4      2025-08-28 [1] CRAN (R 4.5.1)
 MatrixGenerics       * 1.22.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 matrixStats          * 1.5.0      2025-01-07 [1] CRAN (R 4.5.0)
 memoise                2.0.1      2021-11-26 [1] CRAN (R 4.5.0)
 mvtnorm                1.3-3      2025-01-10 [1] CRAN (R 4.5.0)
 nlme                   3.1-168    2025-03-31 [1] CRAN (R 4.5.1)
 numDeriv               2016.8-1.1 2019-06-06 [1] CRAN (R 4.5.0)
 org.Hs.eg.db         * 3.22.0     2025-11-20 [1] Bioconductor
 otel                   0.2.0      2025-08-29 [1] CRAN (R 4.5.1)
 patchwork              1.3.2      2025-08-25 [1] CRAN (R 4.5.1)
 pheatmap             * 1.0.13     2025-06-05 [1] CRAN (R 4.5.0)
 pillar                 1.11.1     2025-09-17 [1] CRAN (R 4.5.1)
 pkgbuild               1.4.8      2025-05-26 [1] CRAN (R 4.5.0)
 pkgconfig              2.0.3      2019-09-22 [1] CRAN (R 4.5.0)
 pkgload                1.4.1      2025-09-23 [1] CRAN (R 4.5.1)
 plyr                   1.8.9      2023-10-02 [1] CRAN (R 4.5.0)
 png                    0.1-8      2022-11-29 [1] CRAN (R 4.5.0)
 polyclip               1.10-7     2024-07-23 [1] CRAN (R 4.5.0)
 preprocessCore         1.72.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 purrr                  1.2.0      2025-11-04 [1] CRAN (R 4.5.1)
 qvalue                 2.42.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 R.methodsS3            1.8.2      2022-06-13 [1] CRAN (R 4.5.0)
 R.oo                   1.27.1     2025-05-02 [1] CRAN (R 4.5.0)
 R.utils                2.13.0     2025-02-24 [1] CRAN (R 4.5.0)
 R6                     2.6.1      2025-02-15 [1] CRAN (R 4.5.0)
 rappdirs               0.3.3      2021-01-31 [1] CRAN (R 4.5.0)
 RColorBrewer         * 1.1-3      2022-04-03 [1] CRAN (R 4.5.0)
 Rcpp                   1.1.0      2025-07-02 [1] CRAN (R 4.5.1)
 remotes                2.5.0      2024-03-17 [1] CRAN (R 4.5.0)
 repr                   1.1.7      2024-03-22 [1] CRAN (R 4.5.0)
 reshape2             * 1.4.5      2025-11-12 [1] CRAN (R 4.5.1)
 rlang                  1.1.6      2025-04-11 [1] CRAN (R 4.5.0)
 rmarkdown              2.30       2025-09-28 [1] CRAN (R 4.5.1)
 RSQLite                2.4.5      2025-11-30 [1] CRAN (R 4.5.1)
 rstudioapi             0.17.1     2024-10-22 [1] CRAN (R 4.5.0)
 S4Arrays               1.10.1     2025-12-01 [1] Bioconductor 3.22 (R 4.5.2)
 S4Vectors            * 0.48.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 S7                     0.2.1      2025-11-14 [1] CRAN (R 4.5.1)
 sass                   0.4.10     2025-04-11 [1] CRAN (R 4.5.0)
 scales               * 1.4.0      2025-04-24 [1] CRAN (R 4.5.0)
 scatterpie             0.2.6      2025-09-12 [1] CRAN (R 4.5.1)
 Seqinfo              * 1.0.0      2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 sessioninfo            1.2.3      2025-02-05 [1] CRAN (R 4.5.0)
 skimr                * 2.2.2      2026-01-10 [1] CRAN (R 4.5.1)
 SparseArray            1.10.8     2025-12-18 [1] Bioconductor 3.22 (R 4.5.2)
 statmod                1.5.1      2025-10-09 [1] CRAN (R 4.5.1)
 stringi                1.8.7      2025-03-27 [1] CRAN (R 4.5.0)
 stringr                1.6.0      2025-11-04 [1] CRAN (R 4.5.1)
 SummarizedExperiment * 1.40.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 systemfonts            1.3.1      2025-10-01 [1] CRAN (R 4.5.1)
 tibble                 3.3.0      2025-06-08 [1] CRAN (R 4.5.0)
 tidydr                 0.0.6      2025-07-25 [1] CRAN (R 4.5.1)
 tidyr                * 1.3.2      2025-12-19 [1] CRAN (R 4.5.1)
 tidyselect             1.2.1      2024-03-11 [1] CRAN (R 4.5.0)
 tidytree               0.4.6      2023-12-12 [1] CRAN (R 4.5.0)
 treeio                 1.34.0     2025-10-30 [1] Bioconductor 3.22 (R 4.5.1)
 tweenr                 2.0.3      2024-02-26 [1] CRAN (R 4.5.0)
 usethis                3.2.1      2025-09-06 [1] CRAN (R 4.5.1)
 vctrs                  0.6.5      2023-12-01 [1] CRAN (R 4.5.0)
 vipor                  0.4.7      2023-12-18 [1] CRAN (R 4.5.0)
 vsn                  * 3.78.1     2026-01-13 [1] https://bioc-release.r-universe.dev (R 4.5.2)
 withr                  3.0.2      2024-10-28 [1] CRAN (R 4.5.0)
 xfun                   0.55       2025-12-16 [1] CRAN (R 4.5.1)
 XVector                0.50.0     2025-10-29 [1] Bioconductor 3.22 (R 4.5.1)
 yaml                   2.3.12     2025-12-10 [1] CRAN (R 4.5.1)
 yulab.utils            0.2.3      2025-12-15 [1] CRAN (R 4.5.1)

 [1] /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────