R Data Exploration and Visualisation (2020–2021)

Lieven Clement · Universiteit Gent

In this interactive course you will learn basic data visualisation principles and how to apply them using ggplot2. The course is based on the second part of the e-book Introduction to Data Science authored by Prof. Rafael Irizarry, Department of Data Sciences at the Dana-Farber Cancer Institute and Department of Biostatistics Harvard School of Public Health.

Je bent niet geregistreerd voor deze cursus.



6. Introduction to data visualization

Exploratory data visualization is perhaps the greatest strength of R. One can quickly go from idea to data to plot with a unique balance of flexibility and ease. For example, Excel may be easier than R for some plots, but it is nowhere near as flexible. D3.js may be more flexible and powerful than R, but it takes much longer to generate a plot.

Throughout the book, we will be creating plots using the ggplot2 package.

Titel Voortgang groep Status
7.0 Introduction
7.1 The components of a graph
7.2 ggplot objects
7.3 Geometries
7.4 Aesthetic mappings
7.5 Layers
7.6 Global versus local aesthetic
7.7 Scales
7.8 Labels and titles
7.9 Categories as colors
7.10 Annotation, shapes, and adjustments
7.11 Add-on packages
7.12 Putting it all together
7.13 Quick plots with qplot
7.14 Grids of plots
7.15.1 GGPlot Exercise 1
7.15.2 Een ggplot printen
7.15.3 GGPlot Exercise 3
7.15.4 GGPlot Exercise 4
7.15.5 GGPlot Exercise 5
7.15.6 GGPlot Exercise 6
7.15.7 GGPlot Exercise 7

8. Visualizing data distributions

You may have noticed that numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. For example, you might read a report stating that scores were 680 plus or minus 50 (the standard deviation). The report has summarized an entire vector of scores with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list?

Our first data visualization building block is learning to summarize lists of factors or numeric vectors. More often than not, the best way to share or explore this summary is through data visualization. The most basic statistical summary of a list of objects or numbers is its distribution. Once a vector has been summarized as a distribution, there are several data visualization techniques to effectively relay this information.

In this chapter, we first discuss properties of a variety of distributions and how to visualize distributions using a motivating example of student heights. We then discuss the ggplot2 geometries for these visualizations in Section 8.16.

Titel Voortgang groep Status
8.1 Variable types
8.2 Case study: describing student heights
8.3 Distribution function
8.4 Cumulative distribution functions
8.5 Histograms
8.6 Smoothed density
8.7.1 Proportie in Noord Centraal
8.7.2 Over histogrammen en barplots
8.7.3 75 of lager
8.7.4 De gemiddelde hoogte
8.7.5 Moordstaten
8.7.6 De moorden analyseren
8.7.7 Lengtes analyseren
8.7.9 Grote staten
8.7.10 Smooth
8.8 The normal distribution
8.9 Standard units
8.10 Quantile-quantile plots
8.11 Percentiles
8.12 Boxplots
8.13 Stratification
8.14 Case study: describing student heights (continued)
8.15.1. Percentiles
8.15.2. Interpreting a Boxplot
8.15.3. The Normal Distribution
8.15.4. NBA
8.15.5. Interpret the results
8.16 ggplot2 geometries
8.17.1. Heights Data
8.17.2. Histogram
8.17.3. Smooth Density

9. Data visualization in practice

In this chapter, we will demonstrate how relatively simple ggplot2 code can create insightful and aesthetically pleasing plots. As motivation we will create plots that help us better understand trends in world health and economics. We will implement what we learned in Chapters 7 and 8.16 and learn how to augment the code to perfect the plots. As we go through our case study, we will describe relevant general data visualization principles and learn concepts such as faceting, time series plots, transformations, and ridge plots.

Titel Voortgang groep Status
9.1. Case study: new insights on poverty
9.2. Scatterplots
9.3. Facetting
9.4. Time series plots
9.5. Data transformations
9.6. Visualizing multiple distributions with boxplots and ridge plots
9.7. The ecological fallacy and importance of showing the data

10. Data visualization principles

We have already provided some rules to follow as we created plots for our examples. Here, we aim to provide some general principles we can use as a guide for effective data visualization. Much of this section is based on a talk by Karl Broman titled “Creating Effective Figures and Tables” and includes some of the figures which were made with code that Karl makes available on his GitHub repository, as well as class notes from Peter Aldhous’ Introduction to Data Visualization course. Following Karl’s approach, we show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles. We compare and contrast plots that follow these principles to those that don’t.

The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brains process visual information. When deciding on a visualization approach, it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing distributions for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables. As a final note, we want to emphasize that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audience.

We will be using these libraries:

Titel Voortgang groep Status
10.1. Encoding data using visual cues
10.2. Know when to include 0
10.3. Do not distort quantities
10.4. Order categories by a meaningful value
10.5. Show the data
10.6. Ease comparisons
10.7. Think of the color blind
10.8. Plots for two variables
10.9. Encoding a third variable
10.10. Avoid pseudo-three-dimensional plots
10.11. Avoid too many significant digits
10.12. Know your audience
10.13.1. Bad Plots
10.13.2. Reordering
10.13.3. Boxplot of Murder Rates
10.13.4. 3D Plots
10.14. Case study: vaccines and infectious diseases
10.15.1. Smallpox Tileplot
10.15.2. Smallpox Time Series
10.15.3. Comparing Diseases
Titel Voortgang groep Status
The inter quartile range (IQR)
Tukey's definition of an outlier
Median absolute deviation
Case study: self-reported student heights
11.06.01 Summarizing Exercise
11.06.02 Influence of mistakes exercise
Titel Voortgang groep Status