This webinar provides a comprehensive overview of RNA sequencing (RNA-seq) analysis, detailing its applications, a standard workflow from data processing to interpretation, and available Illumina solutions to facilitate these analyses.
welcome to the webinar fundamentals of
rna sequencing analysis
in this webinar you'll be first
introduced to applications of rna sequencing
sequencing
followed by rna-seq analysis workflow
and lastly
analysis
the power of rna-seq allows us to
perform expression profiling
of a gene or transcript in a single condition
if we have samples from different
conditions we can measure
relative changes in gene or transcript abundance
abundance
this is useful when we want to identify
disease biomarkers
we can also conduct transcript analysis
by assembling the whole transcriptome
and detect known or novel rna species
these species
include splice variants fusions as well
let's dive into what a typical rna-seq
say that your lab is interested in
identifying biomarkers for a cancer
and have sequenced the rna of tumor
samples and non-malignant samples
there are five main steps in an
end-to-end rnc analysis workflow
starting from the transcriptome
profiling of individual samples
differential expression analysis of
samples between two conditions
functional annotation and visualization
of interesting candidates from the analysis
analysis
and lastly integrating with other data
sets as additional evidence to further support
support
these interesting candidates for
experimental validation
the first three steps belong to the
course section of rna-seq analysis
while the last two steps are more
let's start with transcription profiling
of the sample
to profile the expression of known genes
or transcripts in the sample
we generally perform reference guided mapping
we first map the reads in the fastq file
to a reference genome
and gene annotation file to obtain the
raw counts of each gene or transcript
in the bamfall are computed to generate
raw counts
this is in turn normalized for
downstream analysis
i will dive into details on each of
dragon rd pipeline and illumina solution
generates outputs such as the fan file
information and splice junctions
it also provides qc matrix of the fast queues
queues
for rna6 studies of human data or
organisms with well characterized
reference genomes
we can start with the mapping step where
the reads in the fascia files are mapped
to the reference genome or transcriptome
the input are frustu false
reference and genome notation file
this is a peak of the format of a gene
annotation file
where it contains transcript and
external level annotations
that match the reference genome being
mapped against
genome mapping and transcriptome mapping
are performed for different use cases
we perform genome mapping to identify
normal genes or transcripts
in addition to known ones we use splice liners
liners
such as top hat 2 star 2 dragon rna
in contrast the transcriptal mapping is
useful when you're interested
in profiling knowing transcripts the
transcript annotation is comprehensive enough
enough
it unsupplies a liner such as the bow
after obtaining the aligned band files
we want to measure the abundance of a
particular genomic feature in each sample
using the bam file as input we use a
counting tool
together with the gene annotation file
to generate a table of abundance values
depending on the application there are
different approaches to quantify
different genomic features as summarized
by this table
note that all three approaches require a
gene annotation
file the dragon rna pipeline is able to
measure abundance
at the gene and transcript levels which
are useful for gene and trans
expression analysis respectively
for this webinar i'll focus on gene expression
if we obtain raw counts of each gene in
each sample
we introduce normalization to ensure
that we can compare
the expression of each gene across
the input is a table of raw counts where
rows represent the gene and columns
represent the samples the output here is
there are multiple ways of normalizing
we can normalize gene expression by the
library size
different samples have different
sequencing depth which influences the
we can also normalize by gene length as
different genes had different lengths
as a result longer genes would be
counted to have more reads than shorter
most commonly reported measures that
normalized by library size and gene length
length
rfpkm which stands for fragments per kilobase
kilobase
of exon model per million map weeds rpkm
which stands for weeds per kilobase of
exon model per million weeds
and tpm transcripts per million
different tools can adopt different
normalization methods
the dragon rna pipeline generates tpm counts
counts
fpcam rpcm are traditional means of
performing normalization
our pkm is used for single and rna-seq
whereas fp
cams use parent and rna-seq we normalize
for the sequence depth followed by the
gene plant
dpm is a more recent normalization matrix
matrix
where the counter normalized for the g
implant followed by the sequencing depth
and lastly divided by the per million
normalization by distribution is a third method
method
the assumption here is that not all
genes are differentially expressed
the rna composition bias arises when
only a small number of genes are very
highly expressed
in one sample but not in the other
to handle this bias the geometric mean
for each gene is computed across all samples
samples
and the expression values of each gene
in each sample is then normalized by
that mean
d sig 2 and dragon differential
expression pipeline
employ this normalization method before
performing differential expression analysis
after obtaining the normalized counts we
pause for a qc check
to check the quality of the sequencing
data at the count level
to assess biology reproducibility
or batch effects this is important
before we proceed to downstream analysis
the input we use here would be a table
of normalized counts
of all genes across all samples
we first use a dimensionality reduction technique
technique
called principal component analysis that
captures the most variation in that data set
set
in this case it could be the tumor
versus non-malignant
variation across the samples we can
check if the replicates of multiple samples
samples
are clustering as logically as expected
in this case tumor cell replicates are
clustering together
and not mixing with the nonlinear sample replicates