Skip to content
GitLab
Projects
Groups
Snippets
Help
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
S
stat_methods_bioinf
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Issues
0
Issues
0
List
Boards
Labels
Service Desk
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Operations
Operations
Incidents
Environments
Packages & Registries
Packages & Registries
Container Registry
Analytics
Analytics
CI / CD
Repository
Value Stream
Wiki
Wiki
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
Bernd Klaus
stat_methods_bioinf
Commits
a3f4143f
Commit
a3f4143f
authored
Aug 16, 2017
by
Bernd Klaus
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
added bulk RNA-Seq example data
parent
92831134
Changes
4
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
4 changed files
with
344 additions
and
74 deletions
+344
-74
.gitignore
.gitignore
+2
-0
graphics_bioinf.R
graphics_bioinf.R
+34
-0
graphics_bioinf.Rmd
graphics_bioinf.Rmd
+108
-2
graphics_bioinf.html
graphics_bioinf.html
+200
-72
No files found.
.gitignore
View file @
a3f4143f
...
...
@@ -3,3 +3,5 @@ material_stat_methods/
data/
.Rhistory
*_files/
SRP022054/
rse_gene.Rdata
graphics_bioinf.R
View file @
a3f4143f
...
...
@@ -39,6 +39,7 @@ library("tibble")
library
(
"broom"
)
library
(
"scran"
)
library
(
"locfit"
)
library
(
"recount"
)
theme_set
(
theme_gray
(
base_size
=
18
))
...
...
@@ -171,6 +172,39 @@ dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
+
geom_line
(
aes
(
x
=
a
,
y
=
LinReg
),
color
=
"coral3"
))
## ----import_crc----------------------------------------------------------
download_study
(
"SRP022054"
,
type
=
"rse-gene"
)
load
(
file.path
(
"SRP022054"
,
"rse_gene.Rdata"
))
crc_data
<-
rse_gene
crc_data
## ----sumexp, echo=FALSE--------------------------------------------------
par
(
mar
=
c
(
0
,
0
,
0
,
0
))
plot
(
1
,
1
,
xlim
=
c
(
0
,
100
),
ylim
=
c
(
0
,
100
),
bty
=
"n"
,
type
=
"n"
,
xlab
=
""
,
ylab
=
""
,
xaxt
=
"n"
,
yaxt
=
"n"
)
polygon
(
c
(
45
,
80
,
80
,
45
),
c
(
10
,
10
,
70
,
70
),
col
=
rgb
(
1
,
0
,
0
,
.5
),
border
=
NA
)
polygon
(
c
(
45
,
80
,
80
,
45
),
c
(
68
,
68
,
70
,
70
),
col
=
rgb
(
1
,
0
,
0
,
.5
),
border
=
NA
)
text
(
62.5
,
40
,
"assay(s)"
)
text
(
62.5
,
30
,
"e.g. 'counts'"
)
polygon
(
c
(
20
,
40
,
40
,
20
),
c
(
10
,
10
,
70
,
70
),
col
=
rgb
(
0
,
0
,
1
,
.5
),
border
=
NA
)
polygon
(
c
(
20
,
40
,
40
,
20
),
c
(
68
,
68
,
70
,
70
),
col
=
rgb
(
0
,
0
,
1
,
.5
),
border
=
NA
)
text
(
30
,
40
,
"rowData"
)
polygon
(
c
(
45
,
80
,
80
,
45
),
c
(
75
,
75
,
90
,
90
),
col
=
rgb
(
.5
,
0
,
.5
,
.5
),
border
=
NA
)
polygon
(
c
(
45
,
47
,
47
,
45
),
c
(
75
,
75
,
90
,
90
),
col
=
rgb
(
.5
,
0
,
.5
,
.5
),
border
=
NA
)
text
(
62.5
,
82.5
,
"colData"
)
## ----pre_crc-------------------------------------------------------------
counts_crc
<-
assay
(
crc_data
)
counts_crc
[
1
:
5
,
1
:
5
]
counts_p4_meta
<-
counts_crc
[,
c
(
"SRR837829"
,
"SRR837847"
)]
View
(
counts_p4_meta
)
## ----pre_crc_genes-------------------------------------------------------
colnames
(
colData
(
crc_data
))
colData
(
crc_data
)[
1
:
5
,
c
(
"title"
,
"characteristics"
,
"mapped_read_count"
)]
nrow
(
colData
(
crc_data
))
## ----session_info, cache = FALSE-----------------------------------------
sessionInfo
()
graphics_bioinf.Rmd
View file @
a3f4143f
...
...
@@ -67,6 +67,7 @@ library("tibble")
library
(
"broom"
)
library
(
"scran"
)
library
(
"locfit"
)
library
(
"recount"
)
theme_set
(
theme_gray
(
base_size
=
18
))
...
...
@@ -385,10 +386,115 @@ dataL$locFit <- predict(locfit(y~lp(a, nn=0.5, deg=1), data=dataL),
#
Normalization
and
confounding
factors
.
#
Normalization
and
variance
stabilization
of
bulk
and
single
cell
data
##
Normalization
of
single
cell
data
##
Size
factors
for
bulk
RNA
--
Seq
Before
exploring
normalization
of
single
cell
data
,
we
look
at
this
issue
in
bulk
RNA
--
Seq
:
Next
generation
sequencing
data
is
often
analyzed
in
the
form
of
read
counts
assigned
to
genomic
bins
.
Such
bins
typically
correspond
to
all
the
exons
of
a
gene
,
or
an
isoform
for
RNA
--
Seq
data
.
The
aim
of
normalization
is
to
remove
between
--
sample
differences
realted
to
library
sizes
and
other
systematic
technical
effects
.
A
popular
method
for
the
normalization
in
sequencing
data
is
global
scaling
:
all
counts
for
a
given
sample
are
scaled
by
factor
such
that
the
scaled
counts
are
comparable
across
samples
.
Such
a
factor
is
commonly
called
a
"size factor"
and
the
normalized
data
then
computed
as
:
__normalized
sample
counts__
=
__raw
sample
counts__
/
__
sample
size
factor__
##
A
Colectoral
Cancer
exmaple
data
set
We
will
illustrate
this
using
a
dataset
from
the
[
recount2
resource
](
http
://
dx
.
doi
.
org
/
10.1038
/
nbt
.3838
).
[
L
ö
hr
et
.
al
.,
2013
](
https
://
doi
.
org
/
10.1371
/
journal
.
pone
.0067461
)
have
applied
RNA
-
Seq
on
colorectal
normal
,
tumor
and
metastasis
tissues
from
8
patients
to
generate
gene
expression
patterns
.
They
were
particularly
interested
in
differential
expression
of
miRNAs
between
tumor
and
metastasis
and
tumor
/
normal
tissues
.
The
package
`
r
Biocpkg
(
"recount"
)
`
is
part
of
Bioconductor
and
we
use
it
to
conveniently
download
and
import
[
the
data
](
http
://
www
.
ebi
.
ac
.
uk
/
ena
/
data
/
view
/
PRJNA201245
)
into
R
:
```{
r
import_crc
}
download_study
(
"SRP022054"
,
type
=
"rse-gene"
)
load
(
file
.
path
(
"SRP022054"
,
"rse_gene.Rdata"
))
crc_data
<-
rse_gene
crc_data
```
The
data
come
in
the
form
of
a
_RangedSummarizedExperiment_
.
`
SummarizedExperiment
`
is
a
matrix
-
like
structure
where
rows
represent
features
of
interest
(
e
.
g
.
genes
,
transcripts
,
exons
,
etc
.)
and
columns
represent
samples
.
One
or
more
`
assay
(
s
)`,
contain
a
matrix
of
the
actual
experimental
data
:
The
rows
of
an
assay
represent
features
of
interest
.
Annotation
of
these
features
is
retrieved
by
using
the
function
`
rowData
()`
while
metadata
on
the
samples
can
be
obtained
by
a
call
to
`
colData
()`.
The
assay
(
s
)
and
the
row
and
column
data
tables
are
interconnected
,
removing
a
feature
or
sample
from
the
assay
will
also
remove
it
from
the
rowData
and
colData
.
```{
r
sumexp
,
echo
=
FALSE
}
par
(
mar
=
c
(
0
,
0
,
0
,
0
))
plot
(
1
,
1
,
xlim
=
c
(
0
,
100
),
ylim
=
c
(
0
,
100
),
bty
=
"n"
,
type
=
"n"
,
xlab
=
""
,
ylab
=
""
,
xaxt
=
"n"
,
yaxt
=
"n"
)
polygon
(
c
(
45
,
80
,
80
,
45
),
c
(
10
,
10
,
70
,
70
),
col
=
rgb
(
1
,
0
,
0
,
.5
),
border
=
NA
)
polygon
(
c
(
45
,
80
,
80
,
45
),
c
(
68
,
68
,
70
,
70
),
col
=
rgb
(
1
,
0
,
0
,
.5
),
border
=
NA
)
text
(
62.5
,
40
,
"assay(s)"
)
text
(
62.5
,
30
,
"e.g. 'counts'"
)
polygon
(
c
(
20
,
40
,
40
,
20
),
c
(
10
,
10
,
70
,
70
),
col
=
rgb
(
0
,
0
,
1
,
.5
),
border
=
NA
)
polygon
(
c
(
20
,
40
,
40
,
20
),
c
(
68
,
68
,
70
,
70
),
col
=
rgb
(
0
,
0
,
1
,
.5
),
border
=
NA
)
text
(
30
,
40
,
"rowData"
)
polygon
(
c
(
45
,
80
,
80
,
45
),
c
(
75
,
75
,
90
,
90
),
col
=
rgb
(
.5
,
0
,
.5
,
.5
),
border
=
NA
)
polygon
(
c
(
45
,
47
,
47
,
45
),
c
(
75
,
75
,
90
,
90
),
col
=
rgb
(
.5
,
0
,
.5
,
.5
),
border
=
NA
)
text
(
62.5
,
82.5
,
"colData"
)
```
Therefore
,
we
can
preview
the
actual
count
data
matrix
like
so
:
```{
r
pre_crc
}
counts_crc
<-
assay
(
crc_data
)
counts_crc
[
1
:
5
,
1
:
5
]
counts_p4_meta
<-
counts_crc
[,
c
(
"SRR837829"
,
"SRR837847"
)]
View
(
counts_p4_meta
)
```
The
annotation
for
the
genes
looks
like
this
:
```{
r
pre_crc_genes
}
colnames
(
colData
(
crc_data
))
colData
(
crc_data
)[
1
:
5
,
c
(
"title"
,
"characteristics"
,
"mapped_read_count"
)]
nrow
(
colData
(
crc_data
))
```
Where
we
subseted
on
the
interesting
columns
giving
us
information
on
the
samples
.
We
have
various
normal
,
tumor
and
metastasis
tissues
from
8
donors
.
The
authors
had
one
pipelinr
for
the
quantification
of
miRNA
and
one
for
the
quantification
of
"ordinary"
genes
.
But
as
the
recount2
ressource
uses
a
unified
pipeline
with
annotation
data
from
the
[
genecode
project
](
https
://
www
.
gencodegenes
.
org
/
stats
/
archive
.
html
),
which
contains
non
coding
features
as
well
,
we
can
focus
on
the
samples
annotated
as
"mRNA"
.
<
!-- the median gene expression per sample relative to a reference -->
<
!-- sample formed by the geometric mean of each single gene across samples. -->
<
!-- Thus it, gives a scaling factor which makes the gene expression measurements -->
<
!-- (counts) comparable across samples. Dividing the -->
<
!-- counts by the sizefactors will normalize them. -->
<
!-- Note that the size factor normalization will only work if most of the -->
<
!-- genes are not differentially expressed between the samples. We use the function -->
<
!-- \Rfunction{estimateSizeFactorsForMatrix} from the \Rpackage{DESeq2} in order -->
<
!-- to calculate the size factors for each sample. -->
##
sc
data
*
use
scran
size
factors
...
...
graphics_bioinf.html
View file @
a3f4143f
This diff is collapsed.
Click to expand it.
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment