Sage Journals: Discover world-class research

Abstract

Differential gene expression analysis of RNA Sequencing (RNA-Seq) data is crucial for understanding key patterns of gene regulation and enhancing our knowledge of biological processes and diseases. The workflow of this analysis comprises quality control, filtering of low-quality data, alignment, read counting, and final differential analysis. In this case, users often need to manually combine several tools and write multiple scripts to cover the entire pipeline. This fragmented approach is time-consuming and not user-friendly, especially for non-expert users. There is a need for an integrated, automated and accessible solution that unifies the entire analysis process within a single, easy-to-use platform. To address this need, we developed SeqExpressionAnalyser, an R package that provides a web application for interactive differential gene-expression analysis of RNA-seq data, making it accessible to R users for the first time. Built on the Shiny framework, SeqExpressionAnalyser enables users to read FASTQ files and perform analyses, including quality control, filtering, alignment, read counting, and differential expression analysis. The tool generates multiple outputs, including data tables, an HTML report and visualisations. The source code is available on GitHub (https://github.com/sanaeesskhayry/SeqExpressionAnalyser) and is licensed under the GPLv3 license. Also available as a Docker image at https://hub.docker.com/repository/docker/biomix/seq-expression-analyser/general.

Graphical abstract

Keywords

RNA-Seq differential expression interactive data analysis R web application

Introduction

Analysing gene expression provides valuable insights into cellular pathways, disease mechanisms, and potential therapeutic targets, thereby advancing precision medicine, biomarker discovery, and drug development.^1,2 Over time, the techniques for detecting and quantifying gene expression have evolved from low-throughput methods to high-throughput approaches, such as next-generation sequencing, which can efficiently sequence hundreds or even thousands of genes or entire genomes.^1,3,4 RNA-Seq is one of the most transformative methods in this field, allowing researchers to examine a cell or tissue’s transcriptome with exceptional detail and accuracy.^4-7 A primary application of RNA-Seq data analysis is differential gene expression (DGE), which aims to identify genes whose expression levels differ significantly across samples.⁸ For instance, by performing comparative analyses of gene expression levels in both healthy and diseased tissues, researchers can identify genes that are either upregulated or downregulated in pathological conditions. This insight is crucial for developing pharmacological interventions targeting these differentially expressed genes.^8,9 The workflow for DGE analysis generally entails several steps, including data quality control, data filtering and trimming, read mapping, read counting, performing differential analysis and finally visualising and interpreting the results.¹⁰ This analysis can be challenging, requiring careful planning, meticulous execution and advanced bioinformatics tools.^7,11 In addition, integrating various tools throughout the process increases complexity. In computational biology, the R language, with its specialised Bioconductor packages, has become an essential resource for researchers engaged in such data processing.¹² R facilitates the management of large datasets, the execution of complex statistical analyses and, significantly, the clear visualisation and interpretation of results, all of which support informed decision-making in computational biology.¹³ However, existing technologies often lack comprehensiveness and user-friendliness, necessitating expertise across different analytical procedures. To address this gap, this work aims to develop a comprehensive, user-friendly tool built in R using the Shiny framework. The SeqExpressionAnalyser tool provides an accessible solution for conducting interactive, thorough DGE analysis of RNA-Seq data.

Materials and Methods

The development of SeqExpressionAnalyser aims to provide a user-friendly solution for automated End-to-End DGE analysis of RNA-Seq data. This tool will streamline analysis by integrating widely used Bioconductor packages within a single web-based interface. It is implemented in R using the Shiny framework, which enables dynamic user interactions through reactive programming. The web application and all its features are activated by the runAnalyser () function.

The overall workflow adopted by the tool is illustrated in Figure 1. The user interface is constructed using the shinydashboard package (https://rstudio.github.io/shinydashboard/), featuring a sidebar organised into tabs that align with each step of the differential expression analysis workflow, ranging from data preparation to DGE evaluation (Figure 1). Users can bypass or selectively engage specific stages of the workflow, allowing them to deviate from a strict linear progression from data setup to completion. They can initiate the process at any stage, such as alignment or differential expression analysis, depending on the preprocessing status and availability of their input data. The central panel is designed to support robust data processing across all phases, ensuring a streamlined and efficient user experience. Interactivity is enhanced by the shinyWidgets package (https://github.com/dreamRs/shinyWidgets), which provides a wide array of custom input widgets. In contrast, the DT package (https://rstudio.github.io/DT/) provides dynamic, sortable data tables for improved exploration of datasets. In addition, rintrojs (https://github.com/carlganz/rintrojs) supplies guided navigation with tooltips and interactive tutorials to aid users. For high-quality visualisations, the ggplot2 package (https://ggplot2.tidyverse.org/) is employed. SeqExpressionAnalyser uses the Bioconductor Rqc package (https://bioconductor.org/packages/release/bioc/html/Rqc.html) to import FASTQ files and assess their quality, thereby identifying potential issues prior to analysis.¹⁴ Rqc is specifically designed for assessing the quality of high-throughput sequencing data and utilises parallel computing to manage large datasets from various sequencing platforms efficiently. It also includes visualisation tools to assist in detecting patterns that may influence subsequent analyses.¹⁴ The QuasR Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/QuasR.html) is utilised to trim and filter low-quality reads according to user-defined parameters using the preprocessReads function.¹⁵ Cleaned reads are then aligned to a reference genome using the Rsubread tool (https://bioconductor.org/packages/release/bioc/html/Rsubread.html),¹⁶ which is renowned for its speed, low memory footprint and accurate read count summaries, particularly when compared to other alignment tools like TopHat2,¹⁷ STAR,¹⁸ and HiSAT2.¹⁹ The SeqExpressionAnalyser package utilises featureCounts²⁰ – a reliable read summarisation tool available via Rsubread – to obtain raw read counts for each gene across various samples in CSV format, which is critical for the differential expression analysis that follows.²⁰ For DGE analysis, SeqExpressionAnalyser is built on the DESeq2 package (https://bioconductor.org/packages/release/bioc/html/DESeq2.html),²¹ which is widely used in the research community for its robust statistical methodologies and solid results across diverse types of high-throughput sequencing datasets. DESeq2 utilises negative binomial generalised linear models to test for differential expression, incorporating data-driven prior distributions to estimate dispersion and fold change. This approach effectively addresses challenges such as small sample sizes, data discreteness, and outliers through shrinkage estimation, yielding findings that are both consistent and interpretable. Furthermore, DESeq2 normalises the dataset by calculating a size factor for each sample.^21,22

Figure 1.

Overview of the SeqExpressionAnalyser workflow.

Our current pipeline provides partial support for long-read sequencing data through the integrated tools Rsubread and featureCounts, both of which support long-read alignments and read counting. Other components, including Rqc for quality control and QuasR for filtering and trimming, were initially developed to analyse short-read data. Furthermore, the SeqExpressionAnalyser software is compatible with any operating system that supports R, ensuring consistent results and making it a versatile and reliable tool for bioinformatics research. As the eqExpressionAnalyser is typically installed on local systems, its performance may vary based on hardware specifications. Based on our experience, a typical modern laptop or workstation with at least 8 to 16 GB of RAM is sufficient to run the tool on various datasets. The resource requirements are expected to scale approximately linearly with an increase in the number of samples.

Results

We developed the SeqExpressionAnalyser, a web-based software package for interactive DGE analysis of RNA-Seq data. To thoroughly evaluate the functionality of the SeqExpressionAnalyser, we conducted tests using data from a study that examined the impact of L-cysteine (Cys) on antimicrobial resistance (AR) in the bacterium Escherichia coli W (GEO entry: GSE215167). RNA-Seq was performed on E coli W under 2 distinct conditions: minimal M9 medium (with 2 samples: GSM6625092, GSM6625093) and minimal M9 medium supplemented with L-cysteine (with 2 samples: GSM6625094, GSM6625095). The reference genome and its annotation were sourced from the NCBI (National Centre for Biotechnology Information) database (https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000184185.1/). A detailed use case for the SeqExpressionAnalyser package, using this dataset, is provided in Supplementary File 1.

Data setup

The Data Setup tab in the SeqExpressionAnalyser tool is designed specifically for uploading RNA-Seq data. Users are required to provide a directory path containing only the relevant FASTQ files, in formats such as. fq, .fastq, .fq.gz, or .fastq.gz. In addition, it is essential to upload a study metadata file in CSV format. This file should include critical information on the study design and experimental details, including growth conditions (e.g., whether the bacteria were cultured in M9 medium with or without L-cysteine). The column names from the metadata will be displayed in a selection list, allowing users to select the appropriate grouping column to organise the uploaded data according to the study design. Users must also specify whether the uploaded data is paired-end or single-end, and indicate the number of parallel workers for sequence read processing. Once all parameters have been set, users can click the ‘Upload FASTQ Files’ button. After a successful upload, a summary will appear in a data table, including the file name, pair, format, group, base directory, number of sampled reads and total reads for each uploaded file (Figure 2). The Data Summary and Insights tab within the Data Setup phase offers a quick review of RNA-Seq data quality through several key representations (Supplementary File 1).

Figure 2.

Data setup tab in the SeqExpressionAnalyser tool.

For demonstration purposes, we provide a dataset comparing gene expression in luminal and basal cells collected from the mammary glands of virgin mice, 18.5-day pregnant mice and 2-day lactating mice. This dataset is identified by accession number GSE60450 in the GEO data (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60450.2).²³ This feature aims to help users become familiar with the data processing pipeline and understand the required input formats.

Quality control

Upon completing the data setup, the user can assess the quality of the uploaded data using various quality-control plots to evaluate the reliability of the sequencing results (Figure 3).¹⁴ The Average Quality plot evaluates overall data quality by displaying the proportion of reads that surpass the PHRED threshold. Typically, high-quality data should have most reads scoring above 30. Conversely, if a significant proportion of reads has a PHRED score below 20, this indicates poor quality. Cycle-specific plots, including the Cycle-specific Average Quality and Cycle-specific Quality Distribution, highlight variations in quality across different base positions. High-quality data tend to show stable scores, whereas sharp declines observed toward the end or at specific cycles indicate areas of poor sequencing quality that may require trimming. In addition, the Cycle-specific Base Call Proportion and GC Content plots assess the nucleotide composition (Figure 3). High-quality data generally displays balanced base calls, with a GC content that aligns with expected ranges. Significant biases or unusual distributions may suggest the presence of artefacts or contamination. The Per Read Mean Quality Distribution illustrates the consistency of the data. In high-quality datasets, most reads have mean scores above 30, whereas in poor-quality datasets, the frequency of low-scoring reads is higher. Finally, the Read Length Distribution helps identify potential sequencing issues by presenting consistent read lengths in high-quality data. A skewed or wide distribution of read lengths may indicate problems such as incomplete sequencing or improper trimming. Collectively, these plots provide a thorough assessment, enabling us to pinpoint poor-quality data that may require further processing or re-sequencing.¹⁴

Figure 3.

Quality control assessment tab in the SeqExpressionAnalyser (eg, Cycle-specific GC content plot).¹⁴

Filtering and trimming

To ensure the integrity of data for downstream analyses, the SeqExpressionAnalyser employs the QuasR package from Bioconductor, allowing for customisable filtering and trimming of input RNA-Seq data.¹⁵ The preprocessReads function allows users to remove low-quality reads and trim adapter sequences from both the 5′ and 3′ ends of each read. The tool offers options for truncating reads and filtering out those that contain an excessive number of ambiguous ‘N’ bases and those that are shorter than a specified length. In addition, low-complexity reads, identified by dinucleotide entropy, can be excluded to prevent keeping uninformative sequences. In paired-end experiments, if either read in a pair does not meet the filtering criteria, both reads will be discarded. This approach ensures consistent data quality throughout the process. Although adapter trimming is supported only for single-read experiments, these preprocessing steps ensure that only high-quality, informative reads are retained for alignment, thereby enhancing the accuracy and reliability of the analysis. The filtering process generates a summary statistics matrix that provides an overview of the preprocessing results. This matrix includes 1 column for each input sequence file (or pair of files) and displays metrics such as the total number of reads (totalSequences), the number of reads matching the 5′ or 3′ adapters (matchTo5pAdapter and matchTo3pAdapter), as well as counts for reads that were too short (tooShort), contained an excessive number of nonbase characters (tooManyN), or were classified as low complexity (lowComplexity). Finally, the matrix concludes with the total number of reads that successfully passed the filtering steps (totalPassed) (Figure 4).

Figure 4.

Filtering and trimming reads using the SeqExpressionAnalyser tool.

Alignment to the reference genome

The SeqExpressionAnalyser uses the RsubRead package to align sequencing reads to a user-provided reference genome.¹⁶ To perform this alignment, the user must provide key input parameters, including the input readfile1 (for single-end reads, this is the FASTQ file; for paired-end data, it refers to the forward file). If applicable, users should also specify readfile2 for mate reads. In addition, users must specify the desired output format for the aligned data, choosing between SAM and BAM, depending on how they intend to store the alignment results and the type of sequencing data being analysed. Once the alignment process is complete, the tool generates output files containing the aligned reads and saves them in the same directory as the input data. Along with these files, a results table is generated that summarises key mapping statistics for each sequencing library (Figure 5). This table presents essential information, including the total number of reads, the number of successfully mapped reads, the count of uniquely mapped reads, the number of multi-mapped reads, and the number of detected insertions and deletions (Figure 5). These metrics provide valuable insights into the quality of alignment and help evaluate the efficiency of the sequencing and mapping processes. A well-executed alignment typically yields a high percentage of uniquely mapped reads, ideally 70%-80% or higher. It should have a low proportion of multi-mapped reads, preferably under 10%, along with an acceptable number of insertions and deletions. In addition, the total number of mapped reads should be substantial, ideally constituting at least 80% of the total reads. A lower mapping rate may indicate quality issues in the sequencing data or in the reference genome. Reducing the occurrence of multi-mapped reads is essential to eliminate ambiguity and ensure accurate assignment of reads to their genomic locations.

Figure 5.

Outputs of the mapping reads to the reference genome using SeqExpressionAnalyser.

Read counting

The SeqExpressionAnalyser utilises the featureCounts function from the Rsubread package to generate a count matrix from mapped RNA-Seq reads, a crucial step in quantifying DGE.²⁰ This process requires BAM (or SAM) files that store RNA-Seq reads aligned to a reference genome, as well as a GTF annotation file that provides detailed information about the genomic locations of genes, exons and transcripts. BAM files contain reads that represent fragments of RNA from the original biological samples, and their alignment to the reference genome indicates their origin. The GTF file serves as a guide, ensuring that each read is accurately assigned to a corresponding gene. During quantification, featureCounts assigns each mapped read to a specific gene based on its alignment position relative to the gene annotations. It then counts the number of reads assigned to each gene, yielding a count matrix in which rows represent genes, columns correspond to samples, and values indicate the number of reads per gene in each sample (Figure 6). This count matrix is essential for downstream analyses, particularly DGE analyses, in which statistical methods determine which genes are significantly upregulated or downregulated under different experimental conditions. The accuracy of this step is crucial, as errors in read assignment or gene annotation can directly impact the biological conclusions drawn from the data.

Figure 6.

Feature counting results using SeqExpressionAnalyser.

DGE analysis

The SeqExpressionAnalyser uses DESeq2 to perform DGE analysis.²¹ To carry out this analysis, users need to provide a count matrix, which should be generated during the quantification step, along with a metadata file. The metadata file must include a column that indicates the experimental conditions for each sample. The column names of the count matrix must exactly match the row names of the metadata and be in the same order to ensure proper mapping of sample information. This alignment is crucial because DESeq2 relies on the metadata to assign experimental conditions to each sample. Any discrepancies in names or order could lead to errors or inaccurate analyses. Ensuring this consistency is essential for preventing errors and enhancing the accuracy of the analysis. The process begins with DESeq2 normalising the raw counts by estimating size factors to account for variations in library depth. It then estimates gene-wise dispersion values to capture biological variability, fits a generalised linear model for each gene and conducts tests for differential expression. This final step involves calculating log₂ fold changes, standard errors, Wald test statistics, raw P values, and adjusted P values and presenting them in a comprehensive results table (Figure 7).²¹ In addition to numerical outputs, DESeq2 provides various diagnostic and visualisation tools to help users evaluate model fit, data quality and sample clustering. These assessments contribute to a robust biological interpretation of the differential expression results. For instance, the MA plot, along with its shrunken log₂ fold-change variant, allows users to examine the relationship between average gene expression and fold change, highlighting potential biases and stabilising effect-size estimates for lowly expressed genes. The dispersion plot provides insight into gene-expression variability across replicates, thereby ensuring that the negative binomial model is appropriately fitted. Furthermore, the Principal Component Analysis (PCA) plot and heatmap of pairwise sample distances enable rapid identification of batch effects and sample outliers, while the volcano plot summarises genes that are both statistically significant and exhibit large expression differences. Finally, a heatmap of Z-scores for the top 20 most variable genes provides an intuitive overview of expression patterns across conditions.

Figure 7.

Differential gene expression analysis tab.

Discussion

The growing demand for an Automated End-to-End and interactive platform for RNA-Seq data analysis drove the development of the SeqExpressionAnalyser tool. Traditional tools and pipelines often require advanced computational skills, the integration of various tools and fragmented scripts, and familiarity with command-line interfaces. This can pose significant barriers for researchers without extensive training in bioinformatics. Our solution addresses this challenge by offering an intuitive web-based interface that seamlessly integrates all essential steps for DGE analysis of RNA-Seq datasets, from read processing to differential expression, while maintaining the robustness and transparency of established R packages. A variety of software options are available for analysing summarised expression data, each differing in features, implementation strategies (including R/Shiny, Python, and JavaScript), distribution formats (such as standalone applications, local web apps, and web services), and compatibility with existing workflows. However, none of these options fully integrates the DGE analysis workflow for R users. A comprehensive comparison of these tools is available in Supplementary File 2. A significant advantage of the application is its ability to automate and streamline a workflow that has traditionally been fragmented, often requiring multiple standalone tools, making it especially beneficial for R users. By incorporating quality control, sequence trimming, read alignment, read counting and differential expression analysis within a single framework, we significantly reduce both the time and complexity associated with conducting a thorough RNA-Seq study. This integration ensures analytical consistency across various steps, addressing a common issue found in multi-tool workflows. Furthermore, each component generates detailed reports and visual outputs, enhancing interpretability. Another notable advantage of our tool is its strong emphasis on data visualisation.²² By integrating exploratory plots such as MA plots, volcano plots and heatmaps users can rapidly identify differentially expressed genes and discern underlying patterns in their datasets. These visualisation tools are crucial for both hypothesis generation and validation, particularly for researchers who may be less familiar with raw sequencing data outputs. In summary, SeqExpressionAnalyser addresses the computational challenges associated with bioinformatics analysis of DGE in RNA-Seq data. It provides a fully integrated, interactive workflow that streamlines the entire pipeline, making RNA-Seq data analysis more accessible and efficient. Reducing technical barriers enables researchers to focus on biological interpretation rather than on complex bioinformatics issues. As RNA-Seq technologies continue to advance, tools such as SeqExpressionAnalyser will play a crucial role in advancing genomic research and its clinical applications. The current version of the package emphasises preprocessing and DGE analysis. However, it lacks comprehensive downstream functional analyses, such as pathway enrichment, gene ontology (GO) analysis and gene set enrichment analysis (GSEA).^24-27 The tool’s integration with widely used Bioconductor packages may not be sufficient for users handling highly customised RNA-Seq datasets, which could require additional modifications to adapt the analysis to their specific needs. Currently, the workflow supports only a single R-based method per core step, thereby limiting the range of analytical approaches available. While there is partial compatibility with long-read sequencing data through components like Rsubread¹⁶ and featureCounts,²⁰ most modules are primarily designed for short-read applications.

To address these limitations, future enhancements will prioritise the integration of containerised versions of popular tools (e.g., STAR,¹⁸ HiSAT2,¹⁹ Salmon,²⁸ RSEM)²⁹ through Docker, creating a flexible and reproducible modular framework. This strategy will enable researchers to select the most appropriate methods for their experimental designs, rather than being confined to a single workflow paradigm. We also aim to enhance support for long-read sequencing by incorporating specialised tools in a modular way. In addition, integrating popular R/Bioconductor packages for functional analysis, such as clusterProfiler³⁰ and DOSE,³¹ directly into the application will enable a smooth transition from DGE outcomes to pathway-level insights within a single interface. These improvements aim to expand the application’s capabilities while ensuring it remains user-friendly and accessible.

Conclusion

SeqExpressionAnalyser is a user-friendly web application designed to simplify RNA-Seq differential expression analysis, making it accessible to researchers of all levels of bioinformatics expertise. It integrates essential Bioconductor packages into a comprehensive, end-to-end workflow that addresses every step, from processing raw sequencing data to conducting differential expression analysis. The platform provides interactive diagnostic outputs that enhance data interpretability and facilitate troubleshooting at each stage of the analysis. Unlike command-line tools that require programming knowledge, SeqExpressionAnalyser minimises technical barriers, allowing life scientists and researchers to perform robust analyses with ease. Future enhancements will emphasise flexibility and the integration of advanced techniques, such as downstream functional analyses, alongside other external tools. By continually evolving, SeqExpressionAnalyser aims to be a comprehensive and intuitive tool for RNA-Seq data analysis, enabling researchers to derive meaningful biological insights from their studies.

Footnotes

Sanae Esskhayry expresses her gratitude for the support of the CNRST (National Centre for Scientific and Technical Research) in Morocco for the PhD Associate Scholarship PASS.

ORCID iDs

Sanae Esskhayry

Ayoub Karret

Ethical Considerations

Not applicable.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Author Contributions

Sanae Esskhayry: Conceptualisation;Writing – original draft;Methodology;Writing – review & editing;Software.

Ouafae Kaissi: Conceptualisation;Writing – original draft;Methodology;Writing – review & editing;Software.

Fouzia Radouani: Writing–original draft;Writing – review & editing.

Jaouhara Maamar: Writing – original draft.

Ayoub Karret: Writing – review & editing.

Rajaa Chahboune: Writing – review & editing.

Rachida Fissoune: Supervision;Writing – review & editing.

Afaf Lamzouri: Writing – review & editing;Supervision;Validation.

Funding

The authors received no financial support for the research,authorship and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research,authorship and/or publication of this article.

Data Availability Statement

The data used in the described use cases (as a demo dataset and in Additional file 1) are available from the following article: Rodionova IA,Lim HG,Gao Y,Rodionov DA,Hutchison Y,Szubin R,et al. CyuR is a dual regulator for L-cysteine-dependent antimicrobial resistance in Escherichia coli . Communications Biology. 2024 Sep 17;7(1). https://pmc.ncbi.nlm.nih.gov/articles/PMC11408624/ . Data were retrieved from the Gene Expression Omnibus ( https://www.ncbi.nlm.nih.gov/geo ) under accession GSE215167 and from BioProject PRJNA889026 . The reference genome and its annotation in GTF format can be accessed from the NCBI database ( https://www.ncbi.nlm.nih.gov/ ) at ( https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000184185.1/ ). In addition,the SeqExpressionAnalyser package is available for download from its GitHub repository (

Supplementary File 1 : Complete use case for the SeqExpressionAnalyser package,based on the bacterium Escherichia coli W dataset (the bacteria were cultured in an M9 medium with or without L-cysteine)

Supplementary File 2 : Comparison of software for interactive RNA-Seq data analysis,including links to publications and source code repositories. Evaluation criteria are in a dedicated sheet:

Availability and Requirements

Project name: SeqExpressionAnalyser

Project home page:

(GitHub).

Docker image:

Operating system(s): Linux,Mac OS,Windows.

Programming language: R.

Other requirements: R 4.4.1 or higher,Bioconductor 3.19 or higher.

Licence: GNU GPLv3.

Any restrictions on use by nonacademics: none.

Supplemental Material

Supplemental material for this article is available online.

References

Singh

Miaskowski

Dhruva

Flowers

Kober

KM.

Mechanisms and measurement of changes in gene expression. Biol Res Nurs. 2018;20:369-382. doi:10.1177/1099800418772161

Kondapuram

Coumar

MS.

Pan-cancer gene expression analysis: identification of deregulated autophagy genes and drugs to target them. Gene. 2022;844:146821. doi:10.1016/j.gene.2022.146821

Qin

Next-generation sequencing and its clinical application. Cancer Biol Med. 2019;16:4. doi:10.20892/j.issn.2095-3941.2018.0055

Costa-Silva

Domingues

Menotti

Hungria

Lopes

FM.

Temporal progress of gene expression analysis with RNA-Seq data: a review on the relationship between computational methods. Comput Struct Biotechnol J. 2022;21:86-98. doi:10.1016/j.csbj.2022.11.051

Lowe

Shirley

Bleackley

Dolan

Shafee

Transcriptomics technologies. PLoS Comput Biol. 2017;13:e1005457. doi:10.1371/journal.pcbi.1005457

Wang

Gerstein

Snyder

RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57-63. doi:10.1038/nrg2484

Conesa

Madrigal

Tarazona

, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13. doi:10.1186/s13059-016-0881-8

Chowdhury

Bhattacharyya

Kalita

JK.

Differential expression analysis of RNA-seq reads: overview, taxonomy, and tools. IEEE/ACM Trans Comput Biol Bioinform. 2020;17:566-586. doi:10.1109/TCBB.2018.2873010

Naldurtiker

Batchu

Kouakou

Terrill

McCommon

Kannan

Differential gene expression analysis using RNA-seq in the blood of goats exposed to transportation stress. Sci Rep. 2023;13:1984. doi:10.1038/s41598-023-29224-5

10.

Rosati

Palmieri

Brunelli

, et al. Differential gene expression analysis pipelines and bioinformatic tools for the identification of specific biomarkers: a review. Comput Struct Biotechnol J. 2024;23:1154-1168. doi:10.1016/j.csbj.2024.02.018

11.

Pereira

Oliveira

Sousa

Bioinformatics and computational tools for next-generation sequencing analysis in clinical genetics. J Clin Med. 2020;9:132. doi:10.3390/jcm9010132

12.

Sepulveda

JL.

Using R and bioconductor in clinical genomics and transcriptomics. J Mol Diagn. 2020;22:3-20. doi:10.1016/j.jmoldx.2019.08.006

13.

Huber

Carey

Gentleman

, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12:115-121. doi:10.1038/nmeth.3252

14.

Souza

de Carvalho

Lopes-Cendes

Rqc: a Bioconductor package for quality control of high-throughput sequencing data. J Stat Softw. 2018;87:1-14. doi:10.18637/jss.v087.c02

15.

Gaidatzis

Lerch

Hahne

Stadler

MB.

QuasR: quantification and annotation of short reads in R. Bioinformatics. 2015;31:1130-1132. doi:10.1093/bioinformatics/btu781

16.

Liao

Smyth

Shi

The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019;47:e47. doi:10.1093/nar/gkz114

17.

Kim

Pertea

Trapnell

Pimentel

Kelley

Salzberg

SL.

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14:R36. doi:10.1186/gb-2013-14-4-r36

18.

Dobin

Davis

Schlesinger

, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15-21. doi:10.1093/bioinformatics/bts635

19.

Kim

Paggi

Park

Bennett

Salzberg

SL.

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019;37:907-915. doi:10.1038/s41587-019-0201-4

20.

Liao

Smyth

Shi

FeatureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30:923-930. doi:10.1093/bioinformatics/btt656

21.

Love

Huber

Anders

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi:10.1186/s13059-014-0550-8

22.

Rutter

Moran Lauter

Graham

Cook

Visualization methods for differential expression analysis. BMC Bioinformatics. 2019;20:458. doi:10.1186/s12859-019-2968-1

23.

Rios

Pal

, et al. EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival. Nat Cell Biol. 2015;17:365-375. doi:10.1038/ncb3110

24.

Reimand

Isserlin

Voisin

, et al. Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc. 2019;14:482-517. doi:10.1038/s41596-018-0103-9

25.

Mubeen

Kodamullil

Hofmann-Apitius

Domingo-Fernández

On the influence of several factors on pathway enrichment analysis. Brief Bioinform. 2022;23:bbac143. doi:10.1093/bib/bbac143

26.

Zhao

Wang

Chen

Zhang

Guo

A literature review of gene function prediction by modeling gene ontology. Front Genet. 2020;11:400. doi:10.3389/fgene.2020.00400

27.

Candia

Ferrucci

Assessment of gene set enrichment analysis using curated RNA-seq-based benchmarks. PLoS ONE. 2024;19:e0302696. doi:10.1371/journal.pone.0302696

28.

Patro

Duggal

Love

Irizarry

Kingsford

Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14:417-419. doi:10.1038/nmeth.4197

29.

Dewey

CN.

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. doi:10.1186/1471-2105-12-323

30.

Wang

Han

QY.

ClusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16:284-287. doi:10.1089/omi.2011.0118

31.

Wang

Yan

QY.

DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics. 2015;31:608-609. doi:10.1093/bioinformatics/btu684

SeqExpressionAnalyser: An R Package for Automated End-to-End RNA-Seq Analysis From Reads to Differential Expression

Abstract

Keywords

Introduction

Materials and Methods

Results

Data setup

Quality control

Filtering and trimming

Alignment to the reference genome

Read counting

DGE analysis

Discussion

Conclusion

Footnotes

ORCID iDs

Ethical Considerations

Consent to Participate

Consent for Publication

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

Availability and Requirements

Supplemental Material

References