Name: Data Science Applications in Agriculture
Author: Miel Hostens

Prologue
28 januari 2025 06:00

Status	Status	Type	Titel	Voortgang groep
			Preface
			Acknowledgments
			Introduction

Installing R and Rstudio
28 januari 2025 06:00

The instructions below include screen shots from the installation process in which we used the Chrome browser which, although not necessary, you can freely download and install from here: https://www.google.com/chrome/.

Status	Status	Type	Titel	Voortgang groep
			36.1 Installing R
			36.2 Installing RStudio

1A. Getting started with R and RStudio
28 januari 2025 06:00

Status	Type	Titel	Voortgang groep	Status	Acties
		1.1. Why R?
		1.2. The R console
		1.3. Scripts
		1.4. RStudio
		1.5. Installing R packages

1B. Getting started with Google Colab
28 januari 2025 06:00

Status	Type	Titel	Voortgang groep	Status	Acties
		Google Colab

2. R Basics
04 februari 2025 06:00

In this book, we will be using the R software environment for all our analysis. You will learn R and data analysis techniques simultaneously. To follow along you will therefore need access to R. We also recommend the use of an integrated development environment (IDE), such as RStudio, to save your work. Note that it is common for a course or workshop to offer access to an R environment and an IDE through your web browser, as done by RStudio cloud. If you have access to such a resource, you don’t need to install R and RStudio. However, if you intend on becoming an advanced data analyst, we highly recommend installing these tools on your computer. Both R and RStudio are free and available online. We suggest to develop your code for the exercises in RStudio and to paste your script in dodona to evaluate them.

Status	Type	Titel	Voortgang groep	Status	Acties
		2.1 Case study: US Gun Murders
		2.2 The very basics
		2.3.1. Som van integers 1,...,100
		2.3.2. Som van integers 1,...,1000
		2.3.3. Interpreteer code
		2.3.4. Geneste functies
		2.3.5. Interpreteer code
		2.4 Data types
		2.5.1. Variabelen in een dataframe
		2.5.2 Namen van variabelen
		2.5.3 Variabelen onderzoeken
		2.5.4 Meerdere manieren om toegang te krijgen tot een variabele
		2.5.5 Factors
		2.5.6 Tabellen
		2.6 Vectors
		2.7 Coercion
		2.8.1-5. Vectoren
		2.8.6. Vector van getallen 12..73
		2.8.7. Oneven getallen
		2.8.8. Lengte van een reeks
		2.8.9. Klasse van seq(1, 10, 0.5)
		2.8.10. Klasse van seq(1, 10)
		2.8.11. 1 vs 1L
		2.8.12. Vector cast
		2.9 Sorting
		2.10.1-4. Dataframes 1
		2.10.5-6. Dataframes 2
		2.10.7-8. NA
		2.11 Vector arithmetics
		2.12.1. Convert Temperatures
		2.12.2. Vector Sum
		2.12.3. Vector Mean
		2.13 Indexing
		2.14.1-5. Dataframe operations
		2.14.6. Match function
		2.14.7-8. Match operator
		2.15 Basic plots
		2.16.1. Scatter Plot
		2.16.2. Histogram
		2.16.3. Boxplot

3. Importing Data
25 februari 2025 06:00

Status	Type	Titel	Voortgang groep	Status	Acties
		5.0 Introduction
		5.1 Paths and the working directory
		5.2 The readr and readxl packages
		5.3 Lost column headers
		5.4 Downloading files
		5.5 R-base importing functions
		5.6 Text versus binary files
		5.7 Unicode versus ASCII
		5.8 Organizing data with spreadsheets
		5.9.1 Spreadsheets 1
		5.9.2 Spreadsheets 2

4. Programming basics
11 februari 2025 06:00

We teach R because it greatly facilitates data analysis, the main topic of this book. By coding in R, we can efficiently perform exploratory data analysis, build data analysis pipelines, and prepare data visualization to communicate results. However, R is not just a data analysis environment but a programming language. Advanced R programmers can develop complex packages and even improve R itself, but we do not cover advanced programming in this book. Nonetheless, in this section, we introduce three key programming concepts: conditional expressions, for-loops, and functions. These are not just key building blocks for advanced programming, but are sometimes useful during data analysis. We also note that there are several functions that are widely used to program in R but that we will not cover in this book. These include split, cut, do.call, and Reduce, as well as the data.table package. These are worth learning if you plan to become an expert R programmer.

Status	Type	Titel	Voortgang groep	Status	Acties
		3.0 Introduction
		3.1 Conditional expressions
		3.2 Defining functions
		3.3 Namespaces
		3.4 For-loops
		3.5 Vectorization and functionals
		3.6.1 Conditionele expressie
		3.6.2 Any en all
		3.6.3 Conditionele veranderingen
		3.6.4 Som van de eerste n getallen
		3.6.5 Functies van meerdere variabelen
		3.6.6 Namespace
		3.6.7-8 Som van de eerste n kwadraten
		3.6.9 Som van de eerste n kwadraten (II)
		3.6.10 Som van de eerste n kwadraten (III)

5. The tidyverse
21 februari 2025 06:00

Up to now we have been manipulating vectors by reordering and subsetting them through indexing. However, once we start more advanced analyses, the preferred unit for data storage is not the vector but the data frame. In this chapter we learn to work directly with data frames, which greatly facilitate the organization of information. We will be using data frames for the majority of this book. We will focus on a specific data format referred to as tidy and on specific collection of packages that are particularly helpful for working with tidy data referred to as the tidyverse.

We can load all the tidyverse packages at once by installing and loading the tidyverse package:

library(tidyverse)

We will learn how to implement the tidyverse approach throughout the book, but before delving into the details, in this chapter we introduce some of the most widely used tidyverse functionality, starting with the dplyr package for manipulating data frames and the purrr package for working with functions. Note that the tidyverse also includes a graphing package, ggplot2, which will be introduced in a later course on data visualization, the readr package discussed in Chapter 5; and many others. In this chapter, we first introduce the concept of tidy data and then demonstrate how we use the tidyverse to work with data frames in this format.

Status	Type	Titel	Voortgang groep	Status	Acties
		4.1 Tidy data
		4.2.1 CO2
		4.2.2 Weight
		4.2.3 BOD
		4.2.4 Some more datasets
		4.3 Manipulating data frames
		4.4.1-3. Dataframe column operations
		4.4.4-6. Dataframe row operations
		4.4.7. Dataframe mixed operations
		4.5 The pipe: %>%
		4.6.1 Pipes
		4.7 Summarizing data
		4.8 Sorting data frames
		4.9.1. Summarizing
		4.9.2. Grouping
		4.9.3. Sorting
		4.10 Tibbles
		4.11 The dot operator
		4.12 do
		4.13 The purrr package
		4.14 Tidyverse conditionals
		4.15.1. Tibbles 1
		4.15.2. Tibbles 2
		4.15.3. Purrr

6. Introduction to data visualization
11 maart 2025 05:00

Status	Type	Titel	Voortgang groep	Status	Acties
		6.0 Introduction to data visualization

7. ggplot2
11 maart 2025 05:00

Exploratory data visualization is perhaps the greatest strength of R. One can quickly go from idea to data to plot with a unique balance of flexibility and ease. For example, Excel may be easier than R for some plots, but it is nowhere near as flexible. D3.js may be more flexible and powerful than R, but it takes much longer to generate a plot.

Throughout the book, we will be creating plots using the ggplot2 package.

library(dplyr)
library(ggplot2)

Status	Type	Titel	Voortgang groep	Status	Acties
		7.0 Introduction
		7.1 The components of a graph
		7.2 ggplot objects
		7.3 Geometries
		7.4 Aesthetic mappings
		7.5 Layers
		7.6 Global versus local aesthetic
		7.7 Scales
		7.8 Labels and titles
		7.9 Categories as colors
		7.10 Annotation, shapes, and adjustments
		7.11 Add-on packages
		7.12 Putting it all together
		7.13 Quick plots with qplot
		7.14 Grids of plots
		7.15.1 GGPlot Exercise 1
		7.15.2 Een ggplot printen
		7.15.3 GGPlot Exercise 3
		7.15.4 GGPlot Exercise 4
		7.15.5 GGPlot Exercise 5
		7.15.6 GGPlot Exercise 6
		7.15.7 GGPlot Exercise 7

8. Visualizing data distributions
18 maart 2025 05:00

You may have noticed that numerical data is often summarized with the average value. For example, the quality of a high school is sometimes summarized with one number: the average score on a standardized test. Occasionally, a second number is reported: the standard deviation. For example, you might read a report stating that scores were 680 plus or minus 50 (the standard deviation). The report has summarized an entire vector of scores with just two numbers. Is this appropriate? Is there any important piece of information that we are missing by only looking at this summary rather than the entire list?

Our first data visualization building block is learning to summarize lists of factors or numeric vectors. More often than not, the best way to share or explore this summary is through data visualization. The most basic statistical summary of a list of objects or numbers is its distribution. Once a vector has been summarized as a distribution, there are several data visualization techniques to effectively relay this information.

In this chapter, we first discuss properties of a variety of distributions and how to visualize distributions using a motivating example of student heights. We then discuss the ggplot2 geometries for these visualizations in Section 8.16.

Status	Type	Titel	Voortgang groep	Status	Acties
		8.0 Introduction
		8.1 Variable types
		8.2 Case study: describing student heights
		8.3 Distribution function
		8.4 Cumulative distribution functions
		8.5 Histograms
		8.6 Smoothed density
		8.7.1 Proportie in Noord Centraal
		8.7.2 Over histogrammen en barplots
		8.7.3 75 of lager
		8.7.4 De gemiddelde hoogte
		8.7.5 Moordstaten
		8.7.6 De moorden analyseren
		8.7.7 Lengtes analyseren
		8.7.9 Grote staten
		8.7.10 Smooth
		8.8 The normal distribution
		8.9 Standard units
		8.10 Quantile-quantile plots
		8.11 Percentiles
		8.12 Boxplots
		8.13 Stratification
		8.14 Case study: describing student heights (continued)
		8.15.1. Percentiles
		8.15.2. Interpreting a Boxplot
		8.15.3. The Normal Distribution
		8.15.4. NBA
		8.15.5. Interpret the results
		8.16 ggplot2 geometries
		8.17.1. Heights Data
		8.17.2. Histogram
		8.17.3. Smooth Density

9. Data visualization in practice
25 maart 2025 05:00

In this chapter, we will demonstrate how relatively simple ggplot2 code can create insightful and aesthetically pleasing plots. As motivation we will create plots that help us better understand trends in world health and economics. We will implement what we learned in Chapters 7 and 8.16 and learn how to augment the code to perfect the plots. As we go through our case study, we will describe relevant general data visualization principles and learn concepts such as faceting, time series plots, transformations, and ridge plots.

Status	Type	Titel	Voortgang groep	Status	Acties
		9.1. Case study: new insights on poverty
		9.2. Scatterplots
		9.3. Facetting
		9.4. Time series plots
		9.5. Data transformations
		9.6. Visualizing multiple distributions with boxplots and ridge plots
		9.7. The ecological fallacy and importance of showing the data

10. Data visualization principles
08 april 2025 06:00

We have already provided some rules to follow as we created plots for our examples. Here, we aim to provide some general principles we can use as a guide for effective data visualization. Much of this section is based on a talk by Karl Broman titled “Creating Effective Figures and Tables” and includes some of the figures which were made with code that Karl makes available on his GitHub repository, as well as class notes from Peter Aldhous’ Introduction to Data Visualization course. Following Karl’s approach, we show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles. We compare and contrast plots that follow these principles to those that don’t.

The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brains process visual information. When deciding on a visualization approach, it is also important to keep our goal in mind. We may be comparing a viewable number of quantities, describing distributions for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables. As a final note, we want to emphasize that for a data scientist it is important to adapt and optimize graphs to the audience. For example, an exploratory plot made for ourselves will be different than a chart intended to communicate a finding to a general audience.

We will be using these libraries:

library(tidyverse)
library(dslabs)
library(gridExtra)

Status	Type	Titel	Voortgang groep	Status	Acties
		10.1. Encoding data using visual cues
		10.2. Know when to include 0
		10.3. Do not distort quantities
		10.4. Order categories by a meaningful value
		10.5. Show the data
		10.6. Ease comparisons
		10.7. Think of the color blind
		10.8. Plots for two variables
		10.9. Encoding a third variable
		10.10. Avoid pseudo-three-dimensional plots
		10.11. Avoid too many significant digits
		10.12. Know your audience
		10.13.1. Bad Plots
		10.13.2. Reordering
		10.13.3. Boxplot of Murder Rates
		10.13.4. 3D Plots
		10.14. Case study: vaccines and infectious diseases
		10.15.1. Smallpox Tileplot
		10.15.2. Smallpox Time Series
		10.15.3. Comparing Diseases

11. Robust summaries
15 april 2025 06:00

Status	Type	Titel	Voortgang groep	Status	Acties
		Outliers
		Median
		The inter quartile range (IQR)
		Tukey's definition of an outlier
		Median absolute deviation
		Case study: self-reported student heights
		11.06.01 Summarizing Exercise
		11.06.02 Influence of mistakes exercise

Data Science Applications in Agriculture (2024–2025)

Miel Hostens · Cornell University

Oefeningenreeksen

Prologue
28 januari 2025 06:00

Installing R and Rstudio
28 januari 2025 06:00

1A. Getting started with R and RStudio
28 januari 2025 06:00

1B. Getting started with Google Colab
28 januari 2025 06:00

2. R Basics
04 februari 2025 06:00

3. Importing Data
25 februari 2025 06:00

4. Programming basics
11 februari 2025 06:00

5. The tidyverse
21 februari 2025 06:00

6. Introduction to data visualization
11 maart 2025 05:00

7. ggplot2
11 maart 2025 05:00

8. Visualizing data distributions
18 maart 2025 05:00

9. Data visualization in practice
25 maart 2025 05:00

10. Data visualization principles
08 april 2025 06:00

11. Robust summaries
15 april 2025 06:00

Data Science Applications in Agriculture (2024–2025)

Miel Hostens · Cornell University

Oefeningenreeksen

Prologue 28 januari 2025 06:00

Installing R and Rstudio 28 januari 2025 06:00

1A. Getting started with R and RStudio 28 januari 2025 06:00

1B. Getting started with Google Colab 28 januari 2025 06:00

2. R Basics 04 februari 2025 06:00

3. Importing Data 25 februari 2025 06:00

4. Programming basics 11 februari 2025 06:00

5. The tidyverse 21 februari 2025 06:00

6. Introduction to data visualization 11 maart 2025 05:00

7. ggplot2 11 maart 2025 05:00

8. Visualizing data distributions 18 maart 2025 05:00

9. Data visualization in practice 25 maart 2025 05:00

10. Data visualization principles 08 april 2025 06:00

11. Robust summaries 15 april 2025 06:00

Prologue
28 januari 2025 06:00

Installing R and Rstudio
28 januari 2025 06:00

1A. Getting started with R and RStudio
28 januari 2025 06:00

1B. Getting started with Google Colab
28 januari 2025 06:00

2. R Basics
04 februari 2025 06:00

3. Importing Data
25 februari 2025 06:00

4. Programming basics
11 februari 2025 06:00

5. The tidyverse
21 februari 2025 06:00

6. Introduction to data visualization
11 maart 2025 05:00

7. ggplot2
11 maart 2025 05:00

8. Visualizing data distributions
18 maart 2025 05:00

9. Data visualization in practice
25 maart 2025 05:00

10. Data visualization principles
08 april 2025 06:00

11. Robust summaries
15 april 2025 06:00