Introduction

This document is meant to give the user an introduction to R, illustrate the use of some of the major uwIntroStats functions, and go through an example data analysis. The examples will be shown with R code and output. For more options, see the help files provided with the package.

We also assume that the user is using RStudio, found here - however, most of the output should be the same regardless of the graphical user interface. Here we introduce the format for each section and subsection below:

Goal: Load the required packages (first install them if you have not already done so).

Function(s): The library() and attach() functions (and install.packages() if we need to install a package).

Interpretation: uwIntroStats uses some functions from other packages, and builds upon others. The packages that these functions come from are: Exact, plyr, geepack, sandwich, and survival. They must be installed for uwIntroStats to work. However, you do not need to load them for the functions in uwIntroStats to work. Some, though, are not at their full potential unless one of these is loaded.

library(uwIntroStats)

## 
## Attaching package: 'uwIntroStats'
## 
## The following object is masked from 'package:base':
## 
##     tabulate

You must load uwIntroStats each time that you start a new R session, but you only have to install packages once. If you do not install one of the packages, R will give you errors.

The `fev` dataset

Reading in the data

Goal: We want to load the fev dataset into R for our analysis.

Functions: We will be working with the fev dataset. To prepare that (we will see why we name this fevDat later):

fevDat <- read.table("http://www.emersonstatistics.com/Datasets/fev.txt", header = TRUE)
attach(fevDat)

We use the read.table() function, the attach() function, and the View() function. Interpretation: We have now loaded the data into R, and can use it in an analysis. The reference manual for the fev dataset can be found on here. Make sure that we have read in the data correctly by using the View(fev) function. This will bring up a window showing the data. Everything looks good, so we can continue. Note: Generally it is good practice to clean up the data immediately upon reading it in. However, since we are introducing methods, we will go through data cleanup in the following sections.

Standard univariate descriptive statistics

Goal: Describe the variables in the dataset.

Function(s): The descrip() function will do all of this for us.

Interpretation: If we wanted to na"ively get a description of the whole dataset, we could type

descrip(fevDat)

##           N     Msng  Mean      Std Dev    Min       25%       Mdn     
## seqnbr:     654     0   327.5     188.9     1.000     164.2     327.5  
## subjid:     654     0   37170     23691     201.0     15811     36071  
##    age:     654     0   9.931     2.954     3.000     8.000     10.00  
##    fev:     654     0   2.637     0.8671    0.7910    1.981     2.547  
## height:     654     0   61.14     5.704     46.00     57.00     61.50  
##    sex:     654     0   1.486     0.5002    1.000     1.000     1.000  
##  smoke:     654     0   1.901     0.2994    1.000     2.000     2.000  
##            75%       Max     
## seqnbr:     490.8     654.0  
## subjid:     53638     90001  
##    age:     12.00     19.00  
##    fev:     3.119     5.793  
## height:     65.50     74.00  
##    sex:     2.000     2.000  
##  smoke:     2.000     2.000

Notice that we get output for seqnbr and subjid. If we read the documentation for the FEV dataset, we would know that these two variables are label variables, and thus none of the descriptives are of interest. For example, the standard deviation for subjid is 23691, because we have 654 patients with different subject ids. Thus we can ignore these two labelling variables for descriprive purposes. However they, and all of the other variables, show us that there are no missing data as well. Now if we are interested in seeing that all subjects were unique:

cnt <- length(unique(subjid))
print(cnt)

## [1] 654

cnt

## [1] 654

We already knew that there were 654 observations, and now that we know cnt is 654, we know that all of the observations are unique (in other words, no subject was counted twice). Notice that here we typed two different things to display cnt. This illustrates that when we type cnt into R, it actuall calls print(cnt). That is why R is called a functional language.

We might also want to check for outliers within the distribution of a single variable. Look at the distribution of age:

descrip(age)

##        N     Msng  Mean      Std Dev    Min       25%       Mdn     
## age:     654     0   9.931     2.954     3.000     8.000     10.00  
##         75%       Max     
## age:     12.00     19.00

The minimum value is 3. This might be odd, since why would a 3 year old be enrolled in a smoking study? A 3 year old is incredibly unlikely to smoke. How can we tell if this is an outlier? If we look at the mean and the median, we see that they are quite close together - 9.931 and 10, respectively. A large difference between the mean and median would indicate the presence of an outlier, because the mean is heavily influenced by outliers. To illustrate, notice that the formula for the mean \[ \frac{1}{n}\sum_{i=1}^n X_i, \] where the \(X_i\) are each person’s age in this example, is heavily influenced by a single point. If we made one age 100, then the mean would be pulled to the right. The median, however, is robust to outliers. The middle value will not change if we add in an age of 100.

Now let’s look closer at sex and smoke.

descrip(sex, smoke)

##          N     Msng  Mean      Std Dev    Min       25%       Mdn     
##   sex:     654     0   1.486     0.5002    1.000     1.000     1.000  
## smoke:     654     0   1.901     0.2994    1.000     2.000     2.000  
##           75%       Max     
##   sex:     2.000     2.000  
## smoke:     2.000     2.000

Each of these variables takes on either a 1 or a 2. Notice that the mean of sex is 1.486. The interpretation of this value is that 48.6% of the values in sex are 2. Similarly, 90.1% of the values in smoke are 2. This is easy to see, because if all of the values were 1 then the mean would be 1, and if one value is 2 it pulls the mean up. However, having the values coded as 1’s and 2’s is not very useful to us, because we have to remember the documentation each time we look at the data to remember whether sex = 1 denotes a male or a female. Thus good practice is to create indicator variables to reflect the actual data. Usually we say that a 1 is the trait that we want (i.e. if we make a variable called male, then a 1 will reflect a male) while a 0 reflects the opposite. Now in the FEV dataset, we know from the documentation that sex is coded with 1 = male, so we can recode as follows:

male <- ifelse(sex == 1, 1, 0)

which is simply saying that if sex is a 1 at a given position, set male at that position equal to 1. If sex = 2, set male = 0. We want to create a similar indicator variable out of smoke. In this case we want to know if someone smokes or not, so our indicator variable will take on a 1 if they smoke and a 0 if not. In this data, smoke = 2 for a nonsmoker, so we have to be careful when recoding:

smoker <- ifelse(smoke == 2, 0, 1)

Our data now have a much more intuitive interpretation. It is always good practice to clean up your data before running any analysis. If we run descrip() again we can see this:

descrip(male, smoker)

##           N     Msng  Mean      Std Dev    Min       25%       Mdn     
##   male:     654     0   0.5138    0.5002    0.0000    0.0000    1.000  
## smoker:     654     0  0.09939    0.2994    0.0000    0.0000    0.0000 
##            75%       Max     
##   male:     1.000     1.000  
## smoker:     0.0000    1.000

See now that the means have changed. In fact, we see that 51% of the data in sex are males, which is a much more informative description than before. However, in practice, the only appropriate descriptives from this output are the count and the number of missing observations.

Histograms

Goal: We want to check for bimodality in any of the variables.

Function(s): We use histograms, which plot the frequency of each unique value in the data (or places the data into bins decided by the function). The hist() function does all of this.

Interpretation: We can add the arguments main="maintitle", xlab="x label", and ylab="y label" to the hist() function to set a title, x-axis label, and y-axis label, respectively. By default R will set the x and y-axis labels to the variable(s) entered. Notice that we are allowed to use the name fev here because we have named the dataset fevDat. If we had named it fev, we would be in trouble because R would attempt to create a histogram of the whole dataset.

hist(age, main = "Age")

hist(height, main = "Height")

hist(fev, main = "FEV")

There is no bimodality in the data (each variable has only one peak) and thus we do not need to account for this in our analysis.

Stratified descriptive statistics

Goal: Sometimes we believe that a variable has an effect on another. For example, we believe that sex has an effect on height. To see this effect, we need to stratify the height variable - that is, examine height in each sex individually.

Function(s): The descrip() function allows us to do this easily. We can stratify on any number of variables by adding them to the strata argument.

Interpretation:

descrip(age, height, strata = male)

##                   N     Msng  Mean      Std Dev    Min       25%     
##    age:  All        654     0   9.931     2.954     3.000     8.000  
##    age:    Str  0   318     0   9.843     2.933     3.000     8.000  
##    age:    Str  1   336     0   10.01     2.976     3.000     8.000  
## height:  All        654     0   61.14     5.704     46.00     57.00  
## height:    Str  0   318     0   60.21     4.792     46.00     57.50  
## height:    Str  1   336     0   62.03     6.331     47.00     57.00  
##                    Mdn       75%       Max     
##    age:  All        10.00     12.00     19.00  
##    age:    Str  0   10.00     12.00     19.00  
##    age:    Str  1   10.00     12.00     19.00  
## height:  All        61.50     65.50     74.00  
## height:    Str  0   61.00     63.50     71.00  
## height:    Str  1   62.00     67.50     74.00

This produces the same descriptive statistics as our call to descrip() above, but this time breaks down by sex category.

Box Plots

Goal: We want a graphical representation of the data with the mean, standard deviation, range, and potential outliers displayed.

Function(s): The bplot() function builds on the base R function boxplot(). Now we can overlay data, means, and standard deviations onto the boxplot.

Interpretation First, the na"ive box plot for both age and height by sex - which means use the male variable (we only present code for the last plot, with the default values):

However, we can add jittered data to make this plot more informative. Jittered data are created by taking the raw data and adding a tiny bit of random “noise” in both the x and y dimension, so that the points are more easily distinguished from each other. The bplot() function does this by default, though we can turn off one of the dimensions if we choose. We can also set the x-axis labels:

We can also add the mean and standard deviation onto our box plot (bplot() does this by default as well) to get the maximum information from the plot:

bplot(age, strata = male, main = "Age by Sex", xlab = "Sex (0 denotes Female, 1 denotes Male)")

bplot(height, strata = male, main = "Height by Sex", xlab = "Sex (0 denotes Female, 1 denotes Male)")

Scatterplots

Similar to the bplot() function, the scatter() function defaults to include some useful information. However, we will go through the derivation of the default values, so the only code shown will be the last (with defaults). The most basic version of the scatterplot does not differentiate between ties in the data (and thus we see much less than the 654 points on the plot):

However, we like to jitter the data so that we can see all of the points:

This allows us to break the ties - the places where multiple data points lie on the same (x,y) coordinate point - and see all of the data, but still see the discreteness in the sampling. If we wanted to see a line of best fit, we could create one with the plotLsfit argument to our scatter() function:

If we look closely at the graph, the data doesn’t look quite linear. We can solve this issue by plotting a lowess smooth, which will draw attention to the curves in the data:

scatter(height, age)

Confounding

There might be potential confounders in the data. For instance, if smoking stunts growth, and only older kids smoke, then at higher ages we will have two populations separated by some vertical distance, while at lower ages we have only one population. It is generally useful to plot a stratified scatterplot with lowess smooths to check for confounders. By default, scatter() will use different colors for the strata. We can also label the axes and/or the graph:

scatter(height, age, strata = smoker, main = "Height vs Age by Smoking Status")

The legend gives us the color and type of point plotted for each stratum. Now if we wanted to see the lowess smooths for males and females, we can add them as a stratification variable:

scatter(height, age, strata = cbind(smoker, male), main = "Height vs Age by Smoking Status")

Now notice that we can compare groups both across sex (sex 1 is male and sex 2 is female) or across smoking.

Trends and Transforming the Data

Convinced that smoking is not too much of a confounder, we now look at age vs fev. If we do so first with no lowess smooth, it is hard to tell what the trend in the data is:

scatter(fev, age, plotLowess = FALSE, main = "FEV vs Age")

Any pattern is incredibly hard to find. We might think that the data is linear, but putting a lowess smooth on the plot shows us differently:

scatter(fev, age, main = "FEV vs Age")

Now we can see a hint of an S-shaped trend in the data. We believe this, because if we believe that FEV is actually best predicted by height, then at young ages height is roughly linear with age but at older ages height is flat. In order to really understand what is going on, we must look at the relationship between FEV and height:

scatter(fev, height, main = "FEV vs Height")

This is definitely not a linear relationship. In fact, it looks like a cubic relationship. Now if we look at the plot stratified by sex, we don’t see too much of a sex effect:

scatter(fev, height, strata = male, main = "FEV vs Height by Sex")

If we create a new variable called htcub (for height cubed) we can see that there is more of a straight line relationship between cubed height and FEV:

htcub <- height ^ 3
scatter(fev, htcub, main = "FEV by Cubed Height")

There is still heteroscedasticity, however. To remedy this, we can try plotting the cube root of FEV vs Height:

cubrtfev <- fev ^ (1/3)
scatter(cubrtfev, height, main = "Cube Root of FEV by Height")

The `fev` dataset revisited

Now that we know all of these methods, what would we do if presented with the fev dataset? All of our data cleanup would happen immediately when we read in the data, based on the documentation and what we can see in R.

Goal: Read in the fev dataset and make it ready for use and analysis.

Function(s): We first read in the dataset, renaming it as fevDat because we know that one of the variables in the data is named fev and we don’t want confusion when we attach the data:

fevDat <- read.table("http://www.emersonstatistics.com/Datasets/fev.txt", header = TRUE)
attach(fevDat)

## The following objects are masked from fevDat (pos = 3):
## 
##     age, fev, height, seqnbr, sex, smoke, subjid

Now we view the data:

View(fevDat)

After noticing that sex and smoke are not coded as indicator variables, we recode them (using the documentation online for reference):

male <- ifelse(sex == 1, 1, 0)
smoker <- ifelse(smoke == 2, 0, 1)

We also would know which variables are likely confounders or effect modifiers before starting the data analysis. Good practice in statistics dictates the choice of these variables before analysis, so that we are not using our data for exploratory purposes when we want to do inference.

Interpretation: Now we have loaded the dataset, cleaned it up, and know what our confounding variables and effect modifiers are. Now we are ready to create plots and run our analysis.

Example Data Analysis: FEV

University of Washington Department of Biostatistics

Brian D. Williamson and Scott S. Emerson, MD

23 July 2015

Introduction

The `fev` dataset

Reading in the data

Standard univariate descriptive statistics

Histograms

Stratified descriptive statistics

Box Plots

Scatterplots

Confounding

Trends and Transforming the Data

The `fev` dataset revisited

Example Data Analysis: FEV

University of Washington Department of Biostatistics

Brian D. Williamson and Scott S. Emerson, MD

23 July 2015

Introduction

The fev dataset

Reading in the data

Standard univariate descriptive statistics

Histograms

Stratified descriptive statistics

Box Plots

Scatterplots

Confounding

Trends and Transforming the Data

The fev dataset revisited

The `fev` dataset

The `fev` dataset revisited