**Data Visualization Using an Open Source Statistical Program – RStudio**

Author: Shaley Valentine

Grade Level: 10^{th} – 12^{th} grade

Group Size: 1-2 per station; full class

Setting: Computer Lab: Program installed prior to lesson

Time Needed: 2 x 50 minute sessions

Equipment Needed: Computer; Installed version of R/Rstudio (most recent; installation instructions in lesson)

__Objectives__

**Composite Learning Objective**: Students will gain a basic understanding of R statistical software and develop basic statistical and graphing skills using R. Students will use data collected by researchers to (a) compute and visualize simple summary statistics comparing groups of Lake Sturgeon; and (b) assess statistical relationships between different variables.

**Knowledge Outcomes:**

- Students will learn how to install and navigate the R/Rstudio environment.
- Students will learn how to use basic code input to conduct simple data visualization and computation.

**Skills Outcomes **

- Students will compute and visualize summary statistics for comparing two or more groups by their sample average (µ).
- Students will use R/Rstudio to assess statistical relationships between different variables.

**Disposition Outcomes:**

- Student will have an adequate understanding of how fisheries professionals use R/Rstudio to visualize different fish population characteristics using a threatened fish species to provide context.
- Students will understand the importance of open source programming to answer basic fisheries questions.

__Summary__

R is a free, open-sourced statistics program that is widely used by scientists. Statistical analyses, summary statistics, and data visualizations are important tools for students to learn and use. These tools help to simplify relationships or differences between variables. Students will go through a tutorial using program R to learn how to conduct basic summary statistics and analyses and will use graphing functions to compare groups. Students will use data collected by MSU/MiDNR Black Lake Stream-side Facility researchers to evaluate relationships between different variables.

__Background__

R is free statistical software that is used widely by scientists. It is free to download from the internet and all the code is open-sourced, meaning that you can look at how people created the analyses and replicate that code. A simple way of explaining code is that it is a language. In the case of R, the language the code-writer speaks is interpreted, rather than compiled. Simply put, this means that what you type is interpreted by the computer as a command, rather than as a program. So if you tell the computer to conduct an analysis, that’s what it does. Whereas an application on your iPhone is a program, which requires each request to be confirmed by a coded program. As a result, R is intuitive and easy to reproduce. Additionally, other programs like SPSS, SAS, or C++ are extremely costly and generally proprietary. As such it may be difficult to replicate analyses conducted by other researchers.

We can use R to calculate **summary statistics** including the **sample mean**, **median**, **mode**, **quantiles**, and **standard deviation**. These statistics can tell us general trends in data and among groups within data. For example, we could look at average height between males and females to get a general idea of differences between sexes (Figure 1 below). What we see is that females appear to spend less money both on lunch and dinner, and both males and females tend to spend less on lunch than dinner. Graphs like Figure 1 and summary statistics allow researchers to make generalizations about data and compare one or more groups to each other. We can also use data visualization to evaluate other relationships including those which are **predictive**. Figure 2 shows an example of a predictive relationship, where a **scatterplot** is used to show the relationship between height (in) and speed (mi/hour). In this case, we can use data visualization to infer how the value of a predictor variable can affect the magnitude of a **dependent variable**.

Researchers use** Statistical analyses** to evaluate whether differences in data are “significant.” For example, the mean cost of lunch may appear to be different between males and females in Figure 1, but this may not be meaningfully different based on a statistical analysis. Simple analyses include like correlations (looking at how variables are related to each other) and T-tests (comparing the mean values of two groups). Analyses can be powerful, but they do have downfalls. For example, correlations show how related two variables are to one another. Often, two unrelated variables may be correlated, but that correlation may not be of consequence. For example, the price of pickles may increase as your age increases. While this may be true, it’s not your age that causes the price of pickles to increase, thus it’s important to remember that correlation does not equal causation; and that all statistics must be used with care.

In this lesson, we will use RStudio to visually and analyze data collected by researchers at the Black River Streamside Rearing Facility, where researchers have studied threatened Lake Sturgeon since 2001.

__Definitions__

**Summary statistics**: information that gives a quick and simple description of data. Can include mean, median, mode, minimum value, maximum value, range, standard deviation.

**Sample mean: **the average of n observations from the sample. The sum of all values in a sample divided by the number of values in a sample.

**Dependent variable: **a variable (often denoted by y) whose value depends on that of another.

**Median: **The midpoint value of any sample. Where the number of observations is an even number, the median is the average of the two midpoint values.

**Mode: **The value in a sample population which appears most frequently.

**Quantile: **A subset of a sample divided into equal groups. If a population has values of 1, 2, 3, 4, 5, and six, a quantile might represent 2 units; ie: [1,2], [3,4], [4,5].

**Predictive (independent variable): **The process by which a value can be used to infer the magnitude of another value which directly results from the first. (a variable (often denoted by x) whose variation does not depend on that of another).

**Scatter plot: **a graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.

**Standard deviation: **A value used to quantify the amount of variation of individual observations from the sample mean. Ie: One might have a sample of fish sizes: 130cm, 135 cm, 140cm and a second sample of fish sizes: 130cm, 130cm, and 145cm. Both groups have a mean of 135cm. Standard deviation allows us to determine how much each individual differs from the sample mean. In this case, group 2 deviates 8.66cm from the mean, where group 1 deviates 5cm from the mean.

**Statistical analyses: **mathematical comparison between one or more discrete groups and their variability.

**Lesson Sources: **

**Lesson: Introduction to R / RStudio **(PDF)

**Dataset**: 2018 Adult Lake Sturgeon Dataset (CSV)

**Supporting Resources:**

**R Download:** http://cran.mtu.edu/

**R Studio Download:** https://www.rstudio.com/products/rstudio/download/#download

**Online Learning, R Studio: **https://www.rstudio.com/online-learning/

**R Studio Tutorial for Beginners: **https://www.youtube.com/watch?v=mcYcjH-1giM