Why R?
R has become one of the most widely used statistical programming languages, surpassing traditional tools like SAS, SPSS, and Stata.
Millions of users worldwide, spanning academia, industry, and government. Over 20,000 packages on CRAN, covering nearly every domain of data science and statistical modeling and hHundreds of new packages added each month, reflecting a vibrant and innovative developer community.
R’s ecosystem has grown from a niche academic tool into a global platform for reproducible research, machine learning, and advanced analytics. Its open-source nature and rapid package development make it a cornerstone of modern data science.
Why is R so popular?
- Free and open source. Get R from www.r-project.org
- Packages for nearly everything (statistics, graphics, finance, geology …)
- Easy to add/publish packages
- Multiple R interfaces available
Basic Operations in R
3+3 # [1] 6
x <- 3
y = 3
x <- y <- 3
x^2 # square, [1] 9
x <- c(1,2,3,4) # or x = c(1:4)
x # [1] 1 2 3 4
x + 10 # R is vectorized, [1] 11 12 13 14
y <- c(1,2)
x + y # recycling rule, [1] 2 4 4 6
x[4] # show 4th element, [1] 4
x > 2 # [1] FALSE FALSE TRUE TRUE
x[x > 2] # [1] 3 4
x[1:3] # [1] 1 2 3
is.na(x) # [1] FALSE FALSE FALSE FALSE
cumsum(x) # [1] 1 3 6 10
seq(1,5) # [1] 1 2 3 4 5
rep(1,5) # [1] 1 1 1 1 1Basic Mathematical and Statistical Functions
R has a wide range of useful mathematical and statistical functions available through packages.
x <- 1:10
mean(x) # [1] 5.5
median(x) # [1] 5.5
sd(x) # [1] 3.02765
sum(x) # [1] 55
sqrt(x) # [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 ...
summary(x)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.00 3.25 5.50 5.50 7.75 10.00 R online and local help
A good place to start learning R is the online documentation. Launch R and type:
help.start()The inline help is accessed by:
?meanA nice (fun) book:
http://www.burns-stat.com/documents/books/the-r-inferno/
There are many introductory (many free) sites, for example:
https://www.codecademy.com/learn/learn-r
R Packages
You can find many packages through CRAN.
Installing packages:
install.packages("ggplot2")To update all your packages:
update.packages(checkBuilt = TRUE, ask = FALSE)To know the libraries already installed in your machine:
search()
library()Data Structure in R
There are four main types of data structures in R:
- Vectors
- Matrix
- Data.frame
- Lists
Vectors
Vectors are the basic unit for data storage in R. Vectors can store one or more values of the same type (numeric or character strings).
# Vector of strings
x <- c("Germany", "Argentina", "Netherlands")
# Replacing the second element of the vector
x[2] <- NA
x # [1] "Germany" NA "Netherlands"
x[!is.na(x)] # [1] "Germany" "Netherlands"Vector Classes
# character vectors
x <- c("Germany", "Argentina", "Netherlands")
class(x) # [1] "character"
x <- c(1.1, 2.5)
class(x) # [1] "numeric"
x <- 1:4
class(x) # [1] "integer"
x <- x > 2
class(x) # [1] "logical"Matrices
There are many ways to define a matrix.
matrix(1:6, ncol=2)
# [,1] [,2]
# [1,] 1 4
# [2,] 2 5
# [3,] 3 6
matrix(1:6, nrow=2, byrow=TRUE)
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 4 5 6
r1 = c(1,2)
r2 = c(3,4)
A = rbind(r1, r2)
A
# [,1] [,2]
# r1 1 2
# r2 3 4
t(A) # transpose
# r1 r2
# [1,] 1 3
# [2,] 2 4
det(A) # determinant, [1] -2
solve(A) # inverse
# r1 r2
# [1,] -2 1.0
# [2,] 1.5 -0.5
r1 = c(1,-2)
r2 = c(0,1)
B = rbind(r1, r2)
A+B # matrix addition
A%*%B # matrix multiplicationData Frames
Think of data frames as specialized type of lists that can do more than usual matrices.
df <- data.frame(
patientID = c("001", "002", "003"),
treatment = c("drug", "placebo", "drug"),
age = c(20, 30, 24)
)
df
# patientID treatment age
# 1 001 drug 20
# 2 002 placebo 30
# 3 003 drug 24
subset(df, treatment == "drug")
# patientID treatment age
# 1 001 drug 20
# 3 003 drug 24Lists
Lists are powerful and versatile data structures. You can store any data structure in lists, even lists.
my.list <- list(
fruits = c("oranges", "bananas", "apples"),
mat = matrix(1:10, nrow=2)
)
my.list
# $fruits
# [1] "oranges" "bananas" "apples"
#
# $mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 3 5 7 9
# [2,] 2 4 6 8 10Importing Data
It is easy to import data into R. Simply invoke import and export R functions, which exist for many formats including:
- text or ASCII files
- proprietary formats (e.g. SAS, Stata, SPSS, Excel, etc)
- databases (Oracle, MySQL, couchDB, SQLite, etc)
- R data and S
- XML, HTML tables and many more
Reading ASCII/Text files
A quick way to access a csv file is:
read.csv(file.choose(), header=TRUE)The most common R command to import data from text/csv datafiles is read.table and read.csv, e.g.:
URL <- "https:// ... /scorer2014.csv"
scorer <- read.table(URL, sep='', header=TRUE)
scorer$name
# [1] Rodriguez Muller Neymer Messi
# Levels: Messi Muller Neymer Rodriguez
class(scorer) # [1] "data.frame"To check your current directory type getwd(), which you can change, for example, by setwd("C:/Users/YY/Documents/R").
Reading Foreign Files
Use haven for SPSS and Stata – it’s actively maintained, handles modern versions, and integrates well with tidyverse workflows.
install.packages("haven") # Install if needed
library(haven)
# Read SPSS file
mydata <- read_sav("myfile.sav")
# Read Stata data file
mydata <- read_dta("mydata.dta")
The readxl package is now the standard way to import Excel files into R, replacing older approaches like gdata::read.xls or xlsx.
# Read a data frame to Excel
install.packages("readxl") # Install if needed
library(readxl)
mydata <- read_excel("mydata.xlsx", sheet = 1)
# Write a data frame to Excel
install.packages("writexl")
library(writexl)
write_xlsx(mydata, "output.xlsx")The more modern alternative is openxlsx:
install.packages("openxlsx") # Install if needed
library(openxlsx)
mydata <- read.xlsx("mydata.xlsx", sheet = 1) # Read Excel file
# Write a data frame to Excel
write.xlsx(mydata, file = "output.xlsx", sheetName = "Sheet1", overwrite = TRUE)Exporting Data
To save as R data:
save(scorer, file="scorer2014.Rda")
load("scorer2014.Rda") # to reload your saved dataIf you want to export your data to a csv file you may use write.csv or to a text file use write.table:
write.csv(mydata, "scorer2014.csv")
write.table(mydata, file="scorer2014.csv", sep = ',')To export data into other formats (SPSS, Stata) use haven package:
# Write to SPSS (.sav)
write_sav(mydata, "output.sav")
# Write to Stata (.dta)
write_dta(mydata, "output.dta"Data Management Basics
ls() # list objects in the working environment
rm() # remove objects
# data entry with data editor
mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0))
mydata <- edit(mydata)
# list the variables in mydata
names(mydata)
# list the structure of mydata
str(mydata)
# class of an object (numeric, matrix, data frame, etc)
class(age)
class(gender)
# print mydata
mydata
# print first 2 rows of mydata
head(mydata, n=2)
# print last 2 rows of mydata
tail(mydata, n=2)
# Ctrl-L to clear console screen
# Esc to interrupt currently executing commandProgramming Basics
- Simulations
- Looping
- IF … ELSE conditions
- Writing Functions
Simulations
R is really nice for doing simulations and monte carlo experiments:
set.seed(10101)
# drawing from normal distribution
a <- rnorm(1000, mean=0, sd=1)
hist(a)You can also make random draws from other distributions:
# drawing from poisson
rpois(10, 2)You can also sample randomly from a specified set of objects:
sample(1:10, 4) # [1] 7 6 1 5
sample(1:10, 4, replace = T) # [1] 8 8 8 10 # with replacement
sample(letters, 4) # without replacementLoops
Loops are useful (but not always an efficient way) to do repetitive tasks.
for (i in 1:5){
print("Hello")
}
# [1] "Hello"
# [1] "Hello"
# [1] "Hello"
# [1] "Hello"
# [1] "Hello"
# Law of Large Numbers
store <- matrix(NA, 1000, 1)
for (i in 1:1000){
a <- rnorm(i)
store[i] <- mean(a)
}
plot(store, type="l")Apply family of functions
R has a series of functions that can simulate small loops called apply.
mat <- matrix(1:10, nrow=2, byrow=T)
mat
# [,1] [,2] [,3] [,4] [,5]
# [1,] 1 2 3 4 5
# [2,] 6 7 8 9 10
# sum the columns
apply(mat, 2, sum) # [1] 7 9 11 13 15
# sum the rows
apply(mat, 1, sum) # [1] 15 40
rowSums(mat) # gives same resultsIF … ELSE Condition
x <- 1
if(x == 1){
print("x is equal 1")
} else{
print("x is NOT equal to 1")
}
# [1] "x is equal 1"
# Shortcut
ifelse(x == 1, "x is equal 1", "x is NOT equal to 1")
# [1] "x is equal 1"Exercise: Fibonacci Sequence
Let’s use the loop and IF … ELSE conditional statement to produce the first 50 Fibonacci numbers.
fib <- rep(0, 50)
for(i in 1:50){
if (i <= 2){
fib[i] <- 1
} else{
fib[i] <- fib[i-1] + fib[i-2]
}
}
fibFunctions
We can define functions in R, which uses the following basic syntax:
function.name <- function(INPUTS){
ARGUMENTS
}Let’s take a simple example:
mr.euclide <- function(a, b){
dist <- sqrt(sum((b - a)^2))
return(dist)
}To execute the mr.euclide function:
a <- c(1, 1)
b <- c(2, 2)
mr.euclide(a, b) # [1] 1.414214Exercise: Simulating a Dice
dice <- function(n) {
k <- sample(1:6, size = n, replace = T)
return(k)
}
a <- dice(100)
hist(a)Time series from Cauchy distribution
crazy <- function(num) {
x <- rep(NA, num)
for (n in 1:num) x[n] <- mean(rcauchy(n))
plot(x, type = "l", xlab = "Cauchy", ylab = "mean")
}
crazy(100)Challenge: Fibonacci Function
Now incorporate functions to produce say the first 20 Fibonacci numbers:
fib <- function(n = 20){
if(n < 3){
return(c(1, 1))
} else{
fib <- rep(0, n)
for(i in 1:n){
if(i <= 2){
fib[i] <- 1
} else{
fib[i] <- fib[i-1] + fib[i-2]
}
}
}
return(fib)
}
fib(20)Statistical Functions
R is a statistical programming language with many useful and powerful graphical libraries.
Let’s look at some of the functions available with the base package. For example, you can get the values from statistical tables:
pnorm(1.96, 0, 1) # cumulative distribution, [1] 0.9750021
pnorm(1.96, 0, 1, lower.tail = F) # [1] 0.0249979
dnorm(1.96, 0, 1) # density, [1] 0.05844094
dchisq(9, 3) # [1] 0.01329555
1 - pf(4, 43, 3.6) # p-value, [1] 0.1080504Graphics in R
Histograms
normal <- rnorm(1000, mean=1, sd=1)
par(mfrow = c(1, 2))
hist(normal, main = "HISTOGRAM")
plot(density(normal), col = 'red', lwd = 2, main = "DENSITY")Scatter Plots
R can generate high quality publication ready graphics.
data() # Displays a list of all built-in datasets in loaded packages
# data(package = .packages(all.available = TRUE))
head(cars, 3)
# speed dist
# 1 4 2
# 2 4 10
# 3 7 4
plot(cars$speed, cars$dist)Scatter plots with linear regression line
plot(cars$speed, cars$dist)
fit <- lm(cars$dist ~ cars$speed)
abline(fit, col = 'red')
summary(fit)
# Call:
# lm(formula = cars$dist ~ cars$speed)
#
# Residuals:
# Min 1Q Median 3Q Max
# -29.069 -9.525 -2.272 9.215 43.201
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -17.5791 6.7584 -2.601 0.0123 *
# cars$speed 3.9324 0.4155 9.464 1.49e-12 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1Exercise: Simulate a Linear Model
Simulate a linear model:
\[y = \alpha +\beta x + \epsilon\]
where \(\epsilon \sim N(0, 2^2)\), \(x \sim N(0, 1^2)\), \(\alpha = \frac{1}{2}\) and \(\beta = 2\)
set.seed(10101)
x <- rnorm(100)
e <- rnorm(100, 0, 2)
y <- 0.5 + 2 * x + e
summary(y)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -9.3680 -1.1200 0.5318 0.6110 2.7240 7.9840
plot(x, y)Try also fitting a line and find parameters by OLS.
Boxplots
boxplot(len ~ dose * supp, data = ToothGrowth)Saving Pictures
To save your figures use pdf() or savePlot():
pdf(file = 'scatter.pdf', height = 5, width = 5)
plot(cars$speed, cars$dist)
dev.off()Further Resources
There is of course much more to discover in R. Some YouTubes and sites worth visiting are:
- http://www.youtube.com/watch?v=ttv4IA8PEzw
- https://www.youtube.com/watch?v=jWjqLW-u3hc
- https://www.r-project.org/doc/bib/R-books.html
And there are many, many books as well. Good luck and remember to have fun with R.
The abbove is based on my workshop taught at Strathmore University on July 18, 2014. You can view or download the slides here.