Commit e91e2a02 authored by Bernd Klaus's avatar Bernd Klaus
Browse files

some fixes and additions from the Dec16 course

parent 151ca43a
......@@ -386,7 +386,7 @@ and have a levels attribute:
```{r factors}
x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
x <- factor(c("wt", "wt", "mut", "mut"), levels = c("wt", "mut"))
x
```
......@@ -526,7 +526,7 @@ glimpse(pat)
It has weight, height and gender of three people.
We can also use the function `read.xlsx` from the `r CRANpkg("opnxlsx")`
We can also use the function `read.xlsx` from the `r CRANpkg("openxlsx")`
package to import the data from an excel sheet. Here, we have to use the
function `as_tibble` to turn the data.frame into an equivalent tibble.
......@@ -625,6 +625,61 @@ pat[["Gender"]]
More on lists can be found in the respective chapter of "R for data science"
[here](http://r4ds.had.co.nz/vectors.html#lists).
## Summary: data access in R
We prape a simple vector to illustrate the access options again:
```{r acces_recap}
sample_vector <- c("Alice" = 5.4, "Bob" = 3.7, "Claire" = 8.8)
sample_vector
```
### Access by index
The simplest way to access the elements in a vector is via their indices.
Specifically, you provide a vector of indices to say which
elements from the vector you want to retrieve. A minus sign excludes the respective
positions
```{r access_index, dependson="accesRecap"}
sample_vector[1:2]
sample_vector[-(1:2)]
```
### Access by boolean
If you generate a boolean vector the same size as your actual vector you can use
the positions of the true values to pull out certain positions from the full set.
You can also use smaller boolean vectors and they will be concatenated
to match all of the positions in the vector, but this is less common.
```{r access_boolean, dependson="acces_recap"}
sample_vector[c(TRUE, FALSE, TRUE)]
```
This can also be used in conjunction with logical tests
which generate a boolean result. Boolean vectors can be combined with logical operators
to create more complex filters.
```{r access_boolean2, dependson="acces_recap"}
sample_vector[sample_vector < 6]
```
### Access by name
if there are names such as
column names present (note that rowname are not preserved in the tidyverse),
you can access by name as well:
```{r access_name}
sample_vector[c("Alice", "Claire")]
```
## Applying a function to elements of a data structure
R encourages the use of functions for programming. Instead of e.g. looping through
......@@ -685,7 +740,7 @@ function_name <- function(argument_1, argument_2,
```
As you can see, the source code of the function has to be in curly brackets, while
the arguments are defined in the parantheses. Arguments without a default value
the arguments are defined in the parentheses. Arguments without a default value
are mandatory, and default value are specified by equality signs.
By default R returns the result of the last computation performed within the
......@@ -717,9 +772,9 @@ head(map_df(bodyfat, ~ (.x - median(.x)) / mad(.x)), 3)
## Computing variables from existing ones and predicate functions
Often, we want to use variables stored in our data set to compute derived quanities.
Often, we want to use variables stored in our data set to compute derived quantities.
For example, we might be interest in the weight in kilograms instead of pounds
and the hight in meters instead of inches. The function `mutate` allows us to
and the height in meters instead of inches. The function `mutate` allows us to
do this.
```{r transform_example}
......@@ -733,7 +788,7 @@ select(bodyfat, height, height_m, weight, weight_kg)
```
We often want to apply our function only to variables in the data set that are
of a specific type, e.g. numeric, we kann use simple predicate functions that
of a specific type, e.g. numeric, we can use simple predicate functions that
return `TRUE` or `FALSE` in combination with `discard` or `keep` to perform
appropriate selections. For example, we can exclude the id column of the patients
data set, before computing the variable--wise means.
......@@ -762,11 +817,11 @@ __Exercise: Handling a small data set__
# Simple plotting in R: qplot of `r CRANpkg("ggplot2")`
The package `r CRANpkg("ggplot2")` allows very flexible plotting in R,
but takes a while to get acquainted with the underlying grammer
but takes a while to get acquainted with the underlying grammar
of graphics. Thus, we will use its function `qplot()` for "quick plotting",
which requires no knowledge of the underlying advanced features and behaves
much like R's default `plot` function.
However, it offers advanced options like facetting or coloring by condition
However, it offers advanced options like faceting or coloring by condition
as well.
......@@ -806,7 +861,7 @@ qplot(abdomen.circum, percent.fat,
```
We can (unsurprisingly) see that abdomen circumference, weight and bodyfat are highly
correlated to each other. We can also produce a facetted plot split by weight.
correlated to each other. We can also produce a faceted plot split by weight.
```{r qplot_example_facets}
......@@ -822,7 +877,7 @@ __Exercise: Plotting the EMBL logo__
The code below plots the embl logo. The plus sign adds additional
"layers" to the ggplot object modifying any given plot and we use it here
to make the axes dissapear.
to make the axes disappear.
However the colors are not quite right. Can you fix that?
Check out the [ggplot2 docs](http://docs.ggplot2.org/) or try googeling!
......@@ -864,7 +919,7 @@ d
```
If you want perfom a computation for every entry of a list,
If you want perform a computation for every entry of a list,
you usually do the computations for one time step and then for
the next and the next, etc. Because nobody wants
to type the same commands over and over again,
......@@ -904,7 +959,7 @@ __Exercise: Base calling errors__
The function `readError(noBases)` simulates the base calling
errors of a sequencing machine. The parameter
\texttt{noBases} represents the number of positions in the genome sequenced
__noBases__ represents the number of positions in the genome sequenced
and the function will return a vector, which has the entry "error" if a base
calling error occurs at a certain position and "correct" if the base is
read correctly. It can be obtained with the command
......@@ -987,6 +1042,13 @@ y_mat <- t(matrix(y, nrow = 3, ncol = 5, byrow = FALSE))
y_tibble <- as_tibble(y_mat)
names(y_tibble) <- c("Shop_1", "Shop_2", "Shop_3")
map(y_tibble, summary)
# summary by days
days <- matrix(y, nrow = 3, ncol = 5, byrow = FALSE)
days <- as_tibble(days)
names(days) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
map(days, summary)
```
......@@ -998,7 +1060,7 @@ __Exercise: Handling a small data set__
* Which variables are stored in the data frame and what are their values?
* Is there a missing weight value? If yes, replace it by the mean of the other weight values.
* Calculate the mean weight and height of all the patients.
* Calculate the mean weight and height across all the patients.
* Calculate the `BMI = Weight / Height^2` of all the patients.
......@@ -1008,11 +1070,11 @@ __Solution: Handling a small data set__
pat <- read_csv("http://www-huber.embl.de/users/klaus/BasicR/Patients.csv")
pat
map_dbl(keep(pat, is_double), mean)
pat$Weight[2] <- mean(pat$Weight, na.rm = TRUE)
pat
map_dbl(keep(pat, is_double), mean, na.rm = TRUE)
pat <- mutate(pat, BMI = Weight / Height^2)
```
......
This diff is collapsed.
......@@ -203,7 +203,7 @@ There are a couple of operators useful for comparisons:
```{r df_access_old}
pat[2, c("PatientId", "Height")]
pat["P2", c(1, 2)]
pat[2, c(1, 2)]
```
......@@ -245,6 +245,57 @@ pat[["Gender"]]
```
## Summary: data access in R
We prape a simple vector to illustrate the access options again:
```{r acces_recap}
sample_vector <- c("Alice" = 5.4, "Bob" = 3.7, "Claire" = 8.8)
sample_vector
```
## Access by index
* provide a vector of indices to indicate the elements to retrieve
* a minus sign excludes the respective positions
```{r access_index, dependson="accesRecap"}
sample_vector[1:2]
sample_vector[-(1:2)]
```
## Access by boolean
* Using a boolean (TRUE / FALSE) vector that has the same length as
your data will pull out the elements for which the boolean vector is TRUE.
```{r access_boolean, dependson="acces_recap"}
sample_vector[c(TRUE, FALSE, TRUE)]
```
* can also be used in conjunction with logical tests which generate a boolean result
* Boolean vectors can be combined with logical operators to create more complex filters
```{r access_boolean2, dependson="acces_recap"}
sample_vector[sample_vector < 6]
```
## Access by name
if there are names such as
column names present (note that rowname are not preserved in the tidyverse),
you can access by name as well:
```{r access_name}
sample_vector[c("Alice", "Claire")]
```
## Applying a function to elements of a data structure
* instead of looping through a data structure, apply a function to each element
......@@ -325,7 +376,7 @@ select(bodyfat, height, height_m, weight, weight_kg)
* The package **ggplot2** allows very flexible plotting in R
* however, it takes a while to get acquainted with the underlying "grammar of graphics"
* we will introduce its function **`qplot()`** for "quick plotting"
* we will introduce its function **qplot()** for "quick plotting"
```{r qplot, eval=FALSE}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment