worked on mapping/apply functions

parent 735ef363
 ... ... @@ -87,6 +87,7 @@ summary(x[7:12]) ## ----subscr_2, echo = TRUE--------------------------------------------- x[c(2,4,9)] x[-(1:6)] head(x) ## ----sort-rank, echo = TRUE-------------------------------------------- x <- c(1.3, 3.5, 2.7, 6.3, 6.3) ... ... @@ -95,6 +96,12 @@ order(x) x[order(x)] rank(x) ## ----factors------------------------------------------------------------- x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef")) x ## ----object-examples, echo = TRUE-------------------------------------- a <- 9 # is a a string? ... ... @@ -178,12 +185,30 @@ L pat$Height pat[] ## ----apply-example, echo = TRUE---------------------------------------- # Calculate mean for each of the first two columns sapply(X = pat[,1:2], FUN = mean, na.rm = TRUE) # Mean height separately for each gender tapply(X = pat$Height, FUN = mean, INDEX = pat$Ge) ## ----loadBodyfat, echo = TRUE------------------------------------------ load(url("http://www-huber.embl.de/users/klaus/BasicR/bodyfat.rda")) bodyfat <- as_tibble(bodyfat) bodyfat ## ----bodyfat_map, dependson = "loadBodyfat"------------------------------ head(map_dbl(bodyfat, mean)) ## ----function_template, eval=FALSE--------------------------------------- ## function_name <- function(arguments, options) ## { ## return(...) ## } ## ----robust_z------------------------------------------------------------ robust_z <- function(x){ (x - median(x)) / mad(x) } head(map_df(bodyfat, robust_z), 3) ## ----robust_z_implicit--------------------------------------------------- head(map_df(bodyfat, ~ (.x - median(.x)) / mad(.x)), 3) ## ----plot-example, echo = TRUE----------------------------------------- #pdf(file="plot-example.pdf", width=12, height=6) ... ...  ... ... @@ -321,9 +321,11 @@ the elements: {r subscr_2, echo = TRUE} x[c(2,4,9)] x[-(1:6)] head(x)  Additionally, there are some useful commands to order and sort vectors The function head provides a preview of the vector. There are also useful functions to order and sort vectors: * sort: sort in increasing order * order: orders the indexes is such a way that the elements ... ... @@ -377,11 +379,21 @@ There are the following elementary types or ("modes"): * numeric: real number * character: chain of characters, text * factor: String or numbers, describing certain categories * factor: categorical data that takes a fixed set of values * logical: TRUE, FALSE * special values: NA (missing value), NULL ("empty object"), Inf, -Inf (infinity), NaN (not a number) Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute: {r factors} x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef")) x  Data storage types includes matrices, lists, data frames (tibbles), which will be introduced in the next section. Certain types can have different subtypes, e.g. numeric ... ... @@ -613,64 +625,112 @@ pat$Height pat[]  More on lists can be found in the respective chapter of "R for data science" [here](http://r4ds.had.co.nz/vectors.html#lists). ## Apply, mapping and custom functions R encourages the use of functions for programming, instead of e.g. looping through a vector or data frame, you would call a function on your data directly. These kinds of functions are called apply functions. Here, we will use the map familiy of functions from the r CRANpkg("purr") package instead of the base R functions. An apply / map call applies another to a vector or list and returns the result in another vector/list. Thus, each step consists of "mapping" a list value to a result. We will introduce the map functions by looking at a typical data set in a tabular format, where the rows reprsent the samples and the columns the variables measured. The data set bodyfat contains various body measures for 252 men. We turn it into a tibble by using the function as_tibble(). Let's inspect it a bit. The first thing we notice is that tibbles prints only the first 10 rows by default. Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. Additionally, we get a nice summary of the variables available in our data set. {r loadBodyfat, echo = TRUE} load(url("http://www-huber.embl.de/users/klaus/BasicR/bodyfat.rda")) bodyfat <- as_tibble(bodyfat) bodyfat  ## Apply functions As data frames are just a special kind of list, namely a list that is composed of vectors of equal length, we can use a map function to compute the mean value for every variable in our data set. A very useful class of functions in R are apply commands, which allows to apply a function to every row or column of a data matrix, data frame or list: {r bodyfat_map, dependson = "loadBodyfat"} head(map_dbl(bodyfat, mean))  \begin{center}  apply(X, MARGIN, FUN, ...)  \end{center} Here we use map_dbl, to ensure that we get a double value back. There are specialized mapping functions for many data types but you can always use the default map() function as a fallback when there is no specialized equivalent available. *  MARGIN: 1 (row-wise) or 2 (column-wise) *  FUN: The function to apply The map functions are really useful for applying your custom functions, for example we can compute a robust z--score by subtracting the median and deviding by the mean absolute deviation for each variable. This will bring all the variables in the data set to a common scale and make them directly comparable. These kinds of transformations are often performed before clustering or dimensionality reduction. You can create your own functions very easily by adhering to the following template: The dots argument allows you to specify additional arguments that will be passed to FUN. {r function_template, eval=FALSE} function_name <- function(arguments, options) { return(...) }  Special apply are functions include: lapply (lists), sapply (lapply wrapper trying to convert the result into a vector or matrix), tapply and aggregate (apply according to factor groups). As you can see, the source code of the function has to be in curly brackets By default R returns the result of the last computation performed within the curly brackets (often, this will be the last line of the function). However, you can always specify the return value directly with return(). If you want to return multiple values, you can return a list. We can illustrate this again using the patients data set: We can now easily define our function and apply it to the data set. {r robust_z} robust_z <- function(x){ (x - median(x)) / mad(x) } {r apply-example, echo = TRUE} # Calculate mean for each of the first two columns sapply(X = pat[,1:2], FUN = mean, na.rm = TRUE) # Mean height separately for each gender tapply(X = pat$Height, FUN = mean, INDEX = pat$Ge) head(map_df(bodyfat, robust_z), 3)  Here, we used the function map_df to make sure that we get a data frame back. There is an even simpler way to achieve the same goal. Using a tilde (~) to create an R formula, the map functions allow you to define anonymous functions with a default argument .x. With this, we do not need to define our robust z--score function explicitly. Data handling can be much more elegantly performed by the \CRANpkg{plyr} and \CRANpkg{dplyr} packages, which will be introduced in another lab. {r robust_z_implicit} head(map_df(bodyfat, ~ (.x - median(.x)) / mad(.x)), 3)  ### Exercise: Handling a small data set ## Computing variables from existing ones and predicate functions * Read in the data set \file{Patients.csv} from the website __Exercise: Handling a small data set__ \url{http://www-huber.embl.de/users/klaus/BasicR/Patients.csv} * Read in the data set Patients.csv from the website [http://www-huber.embl.de/users/klaus/BasicR/Patients.csv](http://www-huber.embl.de/users/klaus/BasicR/Patients.csv) * Check whether the read in data is actually a data.frame. * Check whether the read in data is actually a data.frame. Make sure that it is a tibble! * Which variables are stored in the data frame and what are their values? * Is there a missing weight value? If yes, replace it by the mean of the other weight values. * Calculate the mean weight and height of all the patients. * Calculate the $\text{BMI}= \text{Weight} / \text{Height}^2$ of all the patients. Attach the BMI vector to the data frame using the function cbind. * Calculate the BMI = Weight / Height^2 of all the patients. * Attach the BMI vector to the data frame using the function cbind. ## Plotting in R ... ... @@ -896,26 +956,6 @@ i.e. if you do not change them, their default values are used. ## Creating your own functions You can create your own functions very easily by adhering to the following template \ function.name<-function(arguments, options)  { return(...) } * The source code of the function has to be in curly brackets * By default R returns the result of the last line of the function, you can specify the return value directly with return(). If you want to return multiple values, you can return a list. ... ...
 ... ... @@ -423,7 +423,10 @@ x + 2
 8.2 5.6 9.3
x[-(1:6)]
  6.5  7.0  9.3  1.2 14.5  6.2

Additionally, there are some useful commands to order and sort vectors

 7.5 8.2 3.1 5.6 8.2 9.3

The function head provides a preview of the vector. There are also
useful functions to order and sort vectors:

• sort: sort in increasing order
• order: orders the indexes is such a way that the elements of the vector are sorted, i.e sort(v) = v[order(v)]

• ... ... @@ -459,10 +462,15 @@ x + 2
• numeric: real number
• character: chain of characters, text
• factor: String or numbers, describing certain categories
• factor: categorical data that takes a fixed set of values
• logical: TRUE, FALSE
• special values: NA (missing value), NULL (“empty object”), Inf, -Inf (infinity), NaN (not a number)

Factors are designed to represent categorical data that can take a fixed set of possible values. Factors are built on top of integers, and have a levels attribute:

x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
x
 ab cd ab
Levels: ab cd ef

Data storage types includes matrices, lists, data frames (tibbles), which will be introduced in the next section. Certain types can have different subtypes, e.g. numeric can be further subdivided into the integer, single and double types. Types can be checked by the is.* and changed (“casted”) by the as.* functions. Furthermore, the function str is very useful in order to obtain an overview of an (possibly complex) object at hand. The following examples will make this clear. We first assign the value 9 to an object and then perform various operations on it.

a <- 9
# is a a string?
...  ...  @@ -663,52 +671,97 @@ L
 1.65 1.30 1.20
pat[]
 1.65 1.30 1.20

More on lists can be found in the respective chapter of “R for data science” here.

6.3 Apply functions

A very useful class of functions in R are
apply commands, which allows to apply a function to every row or column of a data matrix, data frame or list:

6.3 Apply, mapping and custom functions

R encourages the use of functions for programming, instead of e.g. looping through a vector or data frame, you would call a function on your data directly. These kinds of functions are called apply functions. Here, we will use the map familiy of functions from the purr package instead of the base R functions. An apply / map call applies another to a vector or list and returns the result in another vector/list.

Thus, each step consists of “mapping” a list value to a result. We will introduce the map functions by looking at a typical data set in a tabular format, where the rows reprsent the samples and the columns the variables measured. The data set bodyfat contains various body measures for 252 men. We turn it into a tibble by using the function as_tibble().

Let’s inspect it a bit. The first thing we notice is that tibbles prints only the first 10 rows by default. Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. Additionally, we get a nice summary of the variables available in our data set.

bodyfat <- as_tibble(bodyfat)
bodyfat
# A tibble: 252 × 15
density percent.fat   age weight height neck.circum chest.circum
<dbl>       <dbl> <int>  <dbl>  <dbl>       <dbl>        <dbl>
1     1.07        12.3    23    154   67.8        36.2         93.1
2     1.09         6.1    22    173   72.2        38.5         93.6
3     1.04        25.3    22    154   66.2        34.0         95.8
4     1.08        10.4    26    185   72.2        37.4        101.8
5     1.03        28.7    24    184   71.2        34.4         97.3
6     1.05        20.9    24    210   74.8        39.0        104.5
7     1.05        19.2    26    181   69.8        36.4        105.1
8     1.07        12.4    25    176   72.5        37.8         99.6
9     1.09         4.1    25    191   74.0        38.1        100.9
10    1.07        11.7    23    198   73.5        42.1         99.6
# ... with 242 more rows, and 8 more variables: abdomen.circum <dbl>,
#   hip.circum <dbl>, thigh.circum <dbl>, knee.circum <dbl>,
#   ankle.circum <dbl>, bicep.circum <dbl>, forearm.circum <dbl>,
#   wrist.circum <dbl>

As data frames are just a special kind of list, namely a list that is composed of vectors of equal length, we can use a map function to compute the mean value for every variable in our data set.

density percent.fat         age      weight      height neck.circum
1.06       19.15       44.88      178.92       70.15       37.99

Here we use map_dbl, to ensure that we get a double value back. There are specialized mapping functions for many data types but you can always use the default map() function as a fallback when there is no specialized equivalent available.

The map functions are really useful for applying your custom functions, for example we can compute a robust z–score by subtracting the median and deviding by the mean absolute deviation for each variable.

This will bring all the variables in the data set to a common scale and make them directly comparable. These kinds of transformations are often performed before clustering or dimensionality reduction.

You can create your own functions very easily by adhering to the following template:

function_name <- function(arguments, options)
{
return(...)
}

As you can see, the source code of the function has to be in curly brackets By default R returns the result of the last computation performed within the curly brackets (often, this will be the last line of the function). However, you can always specify the return value directly with return(). If you want to return multiple values, you can return a list.

We can now easily define our function and apply it to the data set.

robust_z <- function(x){
}

# A tibble: 3 × 15
density percent.fat   age weight height neck.circum chest.circum
<dbl>       <dbl> <dbl>  <dbl>  <dbl>       <dbl>        <dbl>
1   0.763      -0.745 -1.69 -0.775 -0.759      -0.759       -0.782
2   1.459      -1.414 -1.77 -0.113  0.759       0.211       -0.722
3  -0.648       0.658 -1.77 -0.783 -1.265      -1.686       -0.460
# ... with 8 more variables: abdomen.circum <dbl>, hip.circum <dbl>,
#   thigh.circum <dbl>, knee.circum <dbl>, ankle.circum <dbl>,
#   bicep.circum <dbl>, forearm.circum <dbl>, wrist.circum <dbl>

Here, we used the function map_df to make sure that we get a data frame back. There is an even simpler way to achieve the same goal. Using a tilde (~) to create an R formula, the map functions allow you to define anonymous functions with a default argument .x.

With this, we do not need to define our robust z–score function explicitly.

# A tibble: 3 × 15
density percent.fat   age weight height neck.circum chest.circum
<dbl>       <dbl> <dbl>  <dbl>  <dbl>       <dbl>        <dbl>
1   0.763      -0.745 -1.69 -0.775 -0.759      -0.759       -0.782
2   1.459      -1.414 -1.77 -0.113  0.759       0.211       -0.722
3  -0.648       0.658 -1.77 -0.783 -1.265      -1.686       -0.460
# ... with 8 more variables: abdomen.circum <dbl>, hip.circum <dbl>,
#   thigh.circum <dbl>, knee.circum <dbl>, ankle.circum <dbl>,
#   bicep.circum <dbl>, forearm.circum <dbl>, wrist.circum <dbl>

6.4 Computing variables from existing ones and predicate functions

Exercise: Handling a small data set

• MARGIN: 1 (row-wise) or 2 (column-wise)
• FUN: The function to apply

The dots argument allows you to specify additional arguments that will be passed to FUN.

Special apply are functions include: lapply (lists), sapply (lapply wrapper trying to convert the result into a vector or matrix), tapply and aggregate (apply according to factor groups).

We can illustrate this again using the patients data set:

# Calculate mean for each of the first two columns
sapply(X = pat[,1:2], FUN = mean, na.rm = TRUE)
Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
returning NA
PatientId    Height
NA      1.38
# Mean height separately for each gender
tapply(X = pat$Height, FUN = mean, INDEX = pat$Ge)
f    m
1.42 1.30

Data handling can be much more elegantly performed by the and packages, which will be introduced in another lab.

6.3.1 Exercise: Handling a small data set

• Read in the data set from the website
• Read in the data set Patients.csv from the website

http://www-huber.embl.de/users/klaus/BasicR/Patients.csv

• Check whether the read in data is actually a data.frame.
• Check whether the read in data is actually a data.frame. Make sure that it is a tibble!
• Which variables are stored in the data frame and what are their values?
• Is there a missing weight value? If yes, replace it by the mean of the other weight values.
• Calculate the mean weight and height of all the patients.
• Calculate the $$\text{BMI}= \text{Weight} / \text{Height}^2$$ of all the patients. Attach the BMI vector to the data frame using the function cbind.
• Calculate the BMI = Weight / Height^2 of all the patients.
• Attach the BMI vector to the data frame using the function cbind.

6.4 Plotting in R

6.5 Plotting in R

6.5 Plotting in base R

6.6 Plotting in base R

The default command for plotting is plot(), there are other specialized commands like hist() or pie(). A collection of such specialized commands (e.g. heatmaps and CI plots) can be found in the package . Another useful visualization package is , which includes a heat–scatterplot. The general plot command looks like this:

... ... @@ -732,7 +785,7 @@ x <- seq(-3<
#dev.off()

6.6 and

6.7 and

There’s a quick plotting function in called qplot() which is meant to be similar to the plot() function from base graphics. You can do a lot with qplot(), including splitting plots by factors, but in order to understand how { works, it is better to approach it from from the layering syntax.

All plots begin with the function ggplot(). ggplot() takes two primary arguments, data is the data frame containing the data to be plotted and aes( ) are the aesthetic mappings to pass on to the plot elements.

As you can see, the second argument, aes(), isn’t a normal argument, but another function. Since we’ll never use aes() as a separate function, it might be best to think of it as a special way to pass a list of arguments to the plot.

... ... @@ -776,7 +829,7 @@ ggsmooth
• xlab, ylab, xlim, ylim set the x–/y–axis parameters

6.6.1 Exercise: Plotting the normal density

6.7.1 Exercise: Plotting the normal density

The density of the normal distribution with expected value $$\mu$$ and variance $$\sigma^2$$ is given by: \[ f(x) = \frac{1}{\sigma^2 \sqrt{\pi}} \exp \left(- \frac{1}{2} (\frac{x- \mu}{\sigma})^2 \right) ... ... @@ -790,10 +843,10 @@ f(x)

6.7 Calling functions and programming

6.8 Calling functions and programming

6.8 Calling functions

6.9 Calling functions

Every –function is following the pattern below:
... ... @@ -810,16 +863,6 @@ Every –function is following the pattern below:
• {na.rm = FALSE}: Remove missing values?

Here, x (usually a vector) has to be given in order to run the function, while the other arguments such as trim are optional, i.e. if you do not change them, their default values are used.

You can create your own functions very easily by adhering to the following template

function.name<-function(arguments, options) { return(…) }

• The source code of the function has to be in curly brackets
• By default R returns the result of the last line of the function, you can specify the return value directly with return(). If you want to return multiple values, you can return a list.

As example, we look at the following currency converter function

euro.calc<-function(x, currency="US") {
## currency has a default argrument "US"
...  ...  @@ -997,7 +1040,7 @@ sc.B <- as.matrix( sc.B

6.11.9 Exercise: Handling a small data set

• Read in the data set from the website
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!