Lesson 40 – Discrete distributions in R: Part II

The scarfs and gloves come out of the closet.

The neighborhood Starbucks coffee cups change red.

It’s a reminder that autumn is ending.

It’s a reminder that 4 pm is 8 pm.

It’s a reminder that winter is coming.

Today’s temperature in New York is below 30F – a cold November day.

  1. Do you want to know what the probability of a cold November day is?
  2. Do you want to know what the return period of such an event is?
  3. Do you want to know how many such events happened in the last five years?

Get yourself some warm tea. Let the room heater crackle. We’ll dive into rest of the discrete distributions in R.

Get the Data

The National Center for Environmental Information (NCEI) archives weather day for most of the United States. I requested temperature data for Central Park, NYC. Anyone can go online and submit requests for data. They will deliver it to your email in your preferred file format. I filtered the data for our lesson. You can get it from here.

Preliminary Analysis

You can use the following line to read the data file into R workspace.

# Read the temperature data file #
temperature_data = read.table("temperature_data_CentralPark.txt",header=T)

The data has six columns. The first three columns indicate the year, month and day of the record. The fourth, fifth and sixth columns provide the data for average daily temperature, maximum daily temperature, and minimum daily temperature. We will work with the average daily temperature data.

Next, I want to choose the coldest day in November for all the years in the record. For this, I will look through each year’s November data, identify the day with lowest average daily temperature and store it in a matrix. You can use the following lines to get this subset data.

# Years #
years = unique(temperature_data$Year) # Identifying unique years in the data
nyrs = length(years) # number of years of data

# November Coldest Day #
november_coldtemp = matrix(NA,nrow=nyrs,ncol=1)

for (i in 1:nyrs)
{
 days = which((temperature_data$Year==years[i]) & (temperature_data$Month==11)) # index to find november days in each year
 november_coldtemp[i,1] = min(temperature_data$TAVG[days]) # computing the minimum of these days
}

Notice how I am using the which command to find values.

When I plot the data, I notice that there is a long-term trend in the temperature data. In later lessons, we will learn about identifying trends and their causes. For now, let’s take recent data from 1982 – 2016 to avoid the issues that come with the trend.

# Plot the time series #
plot(years, november_coldtemp,type="o")
# There is trend in the data #

# Take a subset of data from recent years to avoid issues with trend (for now)-- # 
# 1982 - 2016
november_recent_coldtemp = november_coldtemp[114:148]
plot(1982:2016,november_recent_coldtemp,type="o")
Geometric Distribution

In lesson 33, we learned that the number of trials to the first success is Geometric distribution.

If we consider independent Bernoulli trials of 0s and 1s with some probability of occurrence p and assume X to be a random variable that measures the number of trials it takes to see the first success, then, X is said to be Geometrically distributed.

In our example, the independent Bernoulli trials are years. Each year can have a cold November day (if the lowest November temperature in that year is less than 30F) or not.

The probability of occurrence is the probability of experiencing a cold November day. A simple estimate of this probability can be obtained by counting the number of years that had a temperature < 30F and dividing this number by the total sample size. In our restricted example, we chose 35 years of data (1982 – 2016) in which we see ten years with lowest November temperature less than 30F. You can see them in the following table.

Success (Cold November) can happen in the first year, in which case X will be 1. We can see the success in the second year, in which case the sequence will be 01, and X will be 2 and so on.

In R, we can compute the probability P(X=1), P(X=2), etc., using the command dgeom. The Inputs are X and p. Try the following lines to create a visual of the distribution.

######################### GEOMETRIC DISTRIBUTION #########################

# The real data case # 
n = length(november_recent_coldtemp)

cold_years = which(november_recent_coldtemp <= 30)

ncold = length(cold_years)

p = ncold/n

x = 0:n
px = dgeom(x,p)
plot((x+1),px,type="h",xlab="Random Variable X (Cold on kth year)",ylab="Probability P(X=k)",font=2,font.lab=2)
abline(v = (1/p),col="red",lwd=2)
txt1 = paste("probability of cold year (p) = ",round(p,2),sep="")
txt2 = paste("Return period of cold years = E[X] = 1/p ~ 3.5 years",sep="")
text(20,0.2,txt1,col="red",cex=1)
text(20,0.1,txt2,col="red",cex=1)

Notice the geometric decay in the distribution. It can take X years to see the first success (or the next success from the current success). You must have seen that I have a thick red line at 3.5 years. This is the expected value of the geometric distribution. In lesson 34, we learned that the expected value of the geometric distribution is the return period of the event. On average, how many years does it take before we see the cold year again?

Did we answer the first two questions?

  1. Do you want to know what the probability of a cold November day is?
  2. Do you want to know what the return period of such an event is?

Suppose we want to compute the probability that the first success will occur within the next five years, we can use the command pgeom for this purpose.

pgeom computes P(X < 5) as P(X = 1) + P(X = 2) + P(X=3) + P(X = 4). Try it for yourself and verify that they both match.

Suppose the probability is higher or lower, how do you think the distribution will change?

For this, I created an animation of the geometric distribution with changing values of p. See how the distribution is wider for smaller values of p and steeper for larger values of p. A high value of p (probability of the cold November year) indicates that the event will occur more often; hence the trials to success are less in number. On the other hand, a smaller value for p suggests that the event will occur less frequently. The number of trials it takes to see the first/next success is more; creating a wider distribution.

Here is the code for creating the animation. We used similar code last week for animating the binomial distribution.

######## Animation (varying p) #########
# Create png files for Geometric distribution #

png(file="geometric%02d.png", width=600, height=300)

n = 35 # to mimic the sample size for november cold 
x = 0:n

p = 0.1
for (i in 1:5)
{
 px = dgeom(x,p)
 
 plot(x,px,type="h",xlab="Random Variable X (First Success on kth trial)",ylab="Probability P(X=k)",font=2,font.lab=2)
 txt = paste("p=",p,sep="")
 text(20,0.04,txt,col="red",cex=2)
 p = p+0.2
}
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

geometric_png1 <- image_read("geometric01.png","x150")
geometric_png2 <- image_read("geometric02.png","x150")
geometric_png3 <- image_read("geometric03.png","x150")
geometric_png4 <- image_read("geometric04.png","x150")
geometric_png5 <- image_read("geometric05.png","x150")

frames <- image_morph(c(geometric_png1, geometric_png2, geometric_png3, geometric_png4, geometric_png5), frames = 15)
animation <- image_animate(frames)

image_write(animation, "geometric.gif")

Negative Binomial Distribution

In lesson 35, we learned that the number of trials it takes to see the second success is Negative Binomial distribution. The number of trials it takes to see the third success is Negative Binomial distribution. More generally, the number of trials it takes to see the ‘r’th success is Negative Binomial distribution.

We can think of a similar situation where we ask the question, how many years does it take to see the third cold year from the current cold year. It can happen in year3, year 4, year 5, and so on, following a probability distribution.

You can set this up in R using the following lines of code.

################ Negative Binomial DISTRIBUTION #########################
require(combinat)

comb = function(n, x) {
 return(factorial(n) / (factorial(x) * factorial(n-x)))
}

# The real data case # 
n = length(november_recent_coldtemp)

cold_years = which(november_recent_coldtemp <= 30)

ncold = length(cold_years)

p = ncold/n

r = 3 # third cold year

x = r:n

px = NA

for (i in r:n)
{
 dum = comb((i-1),(r-1))*p^r*(1-p)^(i-r)
 px = c(px,dum)
}

px = px[2:length(px)]

plot(x,px,type="h",xlab="Random Variable X (Third Cold year on kth trial)",ylab="Probability P(X=k)",font=2,font.lab=2)

There is an inbuilt command in R for Negative Binomial distribution (dnbinom). I chose to write the function myself using the logic of the negative binomial distribution for a change.

The distribution has a mean of 10.5 years. The third cold year can occur approximately on the 10th year on average.

If you are comfortable so far, think about the following questions:

What happens to the distribution if you change r.

What is the probability that the third cold year will occur within seven years?

Poisson Distribution

Now let’s address the question: how many such events happened in the last five years?”

In lesson 36, Able and Mumble taught us about the Poisson distribution. We now know that counts, i.e., the number of times an event occurs in an interval follows a Poisson distribution. In our example, we are counting events that occur in time, and the interval is five years. Observe the data table and start counting how many events (red color rows) are there in each five-year span starting from 1982.

From 1982 – 1986, there is one event; 1987 – 1991, there are two events; 1992 – 1996, there is one event; 1997 – 2001, there is one event; 2002 – 2006, there are two events; 2007 – 2011, there is one event; 2012 – 2016, there are two events.

These counts (1, 2, 1, 1, 2, 1, 2) follow a Poisson distribution with an average rate of occurrence of 1.43 per five-years.

The probability that X can take any particular value P(X = k) can be computed using the dpois command in R.

Before we create the probability distribution, here are a few tricks to prepare the data.

Data Rearrangement

We have the data in a single vector. If we want to rearrange the data into a matrix form with seven columns of five years each, we can use the array command.

# rearrange the data into columns of 5 years #
data_rearrange = array(november_recent_coldtemp,c(5,7))

This rearrangement will help in computing the number of events for each column.

Counting the number of events

We can write a for loop to count the number of years with a temperature less than 30F for each column. But, R has a convenient function called apply that will perform this same analysis faster.

The apply command can be used to perform any function on the data row-wise, column-wise or both. The user can define the function.

For example, we can count the number of years with November temperature less than 30F for each column using the following one line code.

# count the number of years in each column with temp < 30
counts = apply(data_rearrange,2,function(x) length(which(x <= 30)))

The first argument is the data matrix; the second argument “2” indicates that the function has to be applied for the columns (1 for rows); the third argument is the definition of the function. In this case, we are counting the number of values with a temperature less than 30F. This one line code will count the number of events.

The rate of occurrence is the average of these numbers = 1.43 per five-year period.

We are now ready to plot the distribution for counts assuming they follow a Poisson distribution. Use the following line:

plot(0:5,dpois(0:5,mean(counts)),type="h",xlab="Random Variable X (Cold events in 5 years)",ylab="Probability P(X=k)",font=2,font.lab=2)

You can now tune the knobs and see what happens to the distribution. Remember that the tuning parameter for Poisson distribution is

I will leave you this week with these thoughts.

If we know the function f(x), we can find out the probability of any possible event from it. If the outcomes are discrete (as we see so far), the function is also discrete for every outcome.

What if the outcomes are continuous?

How does the probability distribution function look if the random variable is continuous where the possibilities are infinite?

Like the various types of discrete distributions, are there different types of continuous distributions?

I reminded you at the beginning that the autumn is ending. I am reminding you now that continuous distributions are coming.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 39 – Discrete distributions in R: Part I

It happened again. I got a ticket for parking close to the fire hydrant. Since my first hydrant violation ticket, I have been carrying a measuring tape. I may have miscalculated the curb length this time, or the enforcer’s tape is different than mine. Either way, today’s entry for the Department of Finance’s account is +$115, and my account is -$115.

We can’t win this game. Most hydrants in New York City don’t have painted curbs. It is up to the parking enforcer’s expert judgment — also called our fate. We park and hope that our number does not get picked in the lucky draw.

I want to research the fire hydrant violation tickets in my locality. Since we are learning discrete probability distributions, the violation tickets data can serve as a neat example. New York City Open Data has this information. I will use a subset of this data: parking violations on Broadway in precinct 24 in 2017.

I am also only analyzing those unfortunate souls whose vehicles are not registered in New York and who are parked at least seven feet from the hydrant. No excuse for those who park on the hydrant. Here is a look at the 27 instances under the given criteria.

Today’s lesson includes a journey through Bernoulli trials and Binomial distribution in R and how this example fits the description. Let’s start.

First Steps

Step 1: Get the data
You can download the filtered data file here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson39”. Save the data file in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson39” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson39_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson39” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
I have the filtered data in the file named “parking_data.txt“. It only contains the date when the ticket is issued. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the data to the workspace #
parking_violations = read.table("parking_data.txt",header=T)
Bernoulli Trials

There are two possibilities each day, a ticket (event occurred – success) or no ticket (event did not occur – failure). A yes or a no. These events can be represented as a sequence of 0’s and 1’s (0001100101001000) called Bernoulli trials with a probability of occurrence of p. This probability is constant over all the trials, and the trials are assumed to be independent, i.e., the occurrence of one event does not influence the occurrence of the subsequent event.

In R, we can use the command “rbinom” to create as many outcomes as we require, assuming each trial is independent.

The input arguments are the number of observations we want, the trial (1 in the case of Bernoulli) and the probability p of observing 1s.

#### Bernoulli Trials ####

# The generalized Case #
p = 0.5 # probability of success -- user defined (using 0.5 here)
rbinom(1,1,p) # create 1 random Bernoulli trial
rbinom(10,1,p) # create 10 random Bernoulli trials
rbinom(100,1,p) # create 100 random Bernoulli trials

For this example, based on our data, there are 27 parking violation tickets issues in 181 days from January 1, 2017, to June 30, 2017.

Let us first create a binary coding 0 and 1 from the data. Use the following lines to convert the day into a 1 or a 0.

# Create a Date Series #
days = seq(from=as.Date('2017/1/1'), to=as.Date('2017/6/30'), by="day")

y = as.numeric(format(days,"%y"))
m = as.numeric(format(days,"%m"))
d = as.numeric(format(days,"%d"))

binary_code = matrix(0,nrow=length(m),ncol=1)

for (i in 1:nrow(parking_violations))
{
 dummy = which(m == parking_violations[i,1] & d == parking_violations[i,2])
 
 binary_code[dummy,1] = 1
}

plot(days,binary_code,type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")

Assuming each day is a trial where the outcome can be getting a ticket or not getting a ticket, we can estimate the probability of getting a ticket on any day as 27/181 = 0.15.

So, with p = 0.15, we can simulate 181 outcomes (0 if the ticket is not issued or 1 if the ticket is issued) using the rbinom command. An example sequence:

# For the example case #
n = 1 # it is the number of trials 
p = 0.15 # probability of the event 
nobs = 181

rbinom(nobs,n,p)

0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 
0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 
0 0 0 0 0 0

plot(rbinom(nobs,n,p),type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")

If you simulate such sequences multiple times, on the average, you will see 27 ones.

Run this command multiple times and see how the plot changes. Each run is a new simulation of possible tickets during the 181 days. It occurs randomly, with a probability of 0.15.

Slightly advanced plotting using animation and GIF (for those more familiar with R)

In R, you can create animation and GIF based on these simulations. For example, I want to run the “rbinom” command five times and visualize the changing plots as an animation.

We first save the plots as “.png” files and then combine them into a GIF. You will need to install the “animation” and “magick” packages for this. Try the following lines to create a GIF for the changing Bernoulli plot.

######## Animation #########
# Create png files for Bernoulli sequence #
library(animation)

png(file="bernoulli%02d.png", width=600, height=300)
for (i in 1:5)
 {
 plot(rbinom(nobs,n,p),type="h",font=2,font.lab=2,xlab="Days",ylab="(0,1)")
 }
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

bernoulli_png1 <- image_read("bernoulli01.png","x150")
bernoulli_png2 <- image_read("bernoulli02.png","x150")
bernoulli_png3 <- image_read("bernoulli03.png","x150")
bernoulli_png4 <- image_read("bernoulli04.png","x150")
bernoulli_png5 <- image_read("bernoulli05.png","x150")

frames <- image_morph(c(bernoulli_png1, bernoulli_png2, bernoulli_png3, bernoulli_png4, bernoulli_png5), frames = 10)
animation <- image_animate(frames)

image_write(animation, "bernoulli.gif")

This is how the final GIF looks. Each frame is a simulation from the Bernoulli distribution. Depending on what version of R you are using, there may be more straightforward functions to create GIF.

Binomial Distribution

From the above, if you are interested in the number of successes (1s) in n trials, this number follows a Binomial distribution. If you assume X is a random variable that represents the number of successes in a Bernoulli sequence of n trials, then this X should follow a binomial distribution.

The number of trials is n = 181. The number of successes (getting a ticket) can be between 0 (if there are no tickets in all the 181 days) or 181
(if there is a ticket issued every day).

In R, the probability that this random variable X takes any value k (between 0 and 181), i.e., the probability of exactly k successes in n trials is computed using the command “dbinom.”

For computing the probability of exactly ten tickets in 181 days we can input:

px = dbinom(10,181,p)

For computing the probability of exactly 20 tickets in 181 days we can input:

px = dbinom(20,181,p)

To compute this probability for all possible k‘s and visualizing the probability distribution, we can use the following lines:

n = 181 # define the number of trials 
p = 0.15 # probability of the event 
x = 0:181 # number of successes varying from 0 to 181

px = dbinom(x,n,p)

plot(x,px,type="h",xlab="Random Variable X (Number of tickets in 181 days)",ylab="Probability P(X=k)",font=2,font.lab=2)

Do you know why the probability distribution is centered on 27? What is the expected value of a Binomial distribution?

If we want to compute the probability of getting more than five tickets in one month (30 days), we first calculate the probability for k = 6 to 30 (i.e., for exactly 6 tickets in 30 days to exactly 30 tickets in 30 days) with n = 30 to represent 30 trials.

n = 30 # define the number of trials 
p = 0.15 # probability of the event 
x = 6:30 # number of successes varying from 6 to 30

px = dbinom(x,n,p)
sum(px)

We add all these probability since
P(More than 5 in 30) = P(6 in 30) or P(7 in 30 ) or P(8 in 30) …

P(X > 5 in 30) = P(X = 6 in 30) + P(X=7 in 30) + ... + P(30 in 30).
Slightly advanced plotting using animation and GIF (for those more familiar with R)

I want to experiment with different values of p and check how the probability distribution changes.

I can use the animation and GIF trick to create this visualization.

Run the following lines in your code and see for yourself.

######## Animation #########
# Create png files for Binomial distribution #

png(file="binomial%02d.png", width=600, height=300)

n = 181
x = 0:181

p = 0.1
for (i in 1:5)
{
 px = dbinom(x,n,p)
 
 plot(x,px,type="h",xlab="Random Variable X (Number of tickets in 181 days)",ylab="Probability P(X=k)",font=2,font.lab=2)
 txt = paste("p=",p,sep="")
 text(150,0.04,txt,col="red",cex=2)
 p = p+0.2
}
dev.off()

# Combine the png files saved in the folder into a GIF #

library(magick)

binomial_png1 <- image_read("binomial01.png","x150")
binomial_png2 <- image_read("binomial02.png","x150")
binomial_png3 <- image_read("binomial03.png","x150")
binomial_png4 <- image_read("binomial04.png","x150")
binomial_png5 <- image_read("binomial05.png","x150")

frames <- image_morph(c(binomial_png1, binomial_png2, binomial_png3, binomial_png4, binomial_png5), frames = 15)
animation <- image_animate(frames)

image_write(animation, "binomial.gif")

You can also try changing the values for n and animate those plots.

We will return next week with more R games for the Geometric distribution, Negative Binomial distribution, and the Poisson distribution.

Meanwhile, did you know that RocketMan owes NYC $156,000 for unpaid parking tickets?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 29 – Large number games in R

“it cannot escape anyone that for judging in this way about any event at all, it is not enough to use one or two trials, but rather a great number of trials is required. And sometimes the stupidest man — by some instinct of nature per se and by no previous instruction (this is truly amazing) — knows for sure that the more observations of this sort that are taken, the less danger will be of straying from the mark.”

wrote James Bernoulli, a Swiss mathematician in his thesis Ars Conjectandi in 1713.

Suppose we have an urn with an unknown number of white and black pebbles. Can we take a sample of 10 pebbles, count the number of white and black ones, and deduce the proportion in the container? If 10 is too small a sample, how about 20? How small is too small? James Bernoulli used this illustration to understand, through observations, the likelihood of diseases in human body.

How many sample observations do we need before we can confidently say something is true?

In lesson 6, and lesson 23, we have seen that the probability of an event is its long run relative frequency. How many observations should we get before we know the true probability?

Do data from different probability distribution functions take the same amount of sample observations to see this convergence to the truth?

Do we even know if what we observe in the long run is the actual truth? Remember black swans did not exist till they did.

We will let the mathematicians and probability theorists wrestle with these questions. Today, I will show you some simple tricks in R to simulate large number games. We will use three examples;

tossing a coin,

rolling a dice, and

drawing pebbles from the urn.

All the code in R is here. Let’s get to the nuts and bolts of our toy models.

Coin Toss

If you toss a coin, the probability of getting a head or a tail is 0.5. The two outcomes are independent. By now you should know that it is 0.5 in the long run. In other words, if you toss a coin many times and count the number of times you get head and the number of times you get tails, the probability of head, i.e., the number of heads/N ~ 0.5.

Let us create this experiment in R. Since there is 0.5 probability of getting head or tail; we can use the random number generator to simulate the outcomes. Imagine we have a ruler on the scale of 0 to 1, and we simulate a random number with equal probability, i.e., any number between 0 and 1 is equally possible. If the number is less than 0.5, we assume heads, if it is greater than 0.5, we assume tails.

A random number between 0 and 1 with equal probability can be generated using the runif command in R.

# Generate 1 uniform random number 
r = runif(1)

[1] 0.3627271
# Generate 10 uniform random number 
r = runif(10)

[1] 0.7821440 0.8344471 0.3977171 0.3109202 0.8300459 0.7474802 0.8777750
 [8] 0.7528353 0.9098839 0.3731529
# Generate 100 uniform random number 
r = runif(100)

The steps for this coin toss experiment are:
1) Generate a uniform random number r between 0 – 1
2) If r < 0.5 choose H
3) If r > 0.5, choose T
4) Repeat the experiment N times → N coin tosses.

Here is the code.

# coin toss experiment 
N = 100

r = runif(N)

x = matrix(NA,nrow = N,ncol = 1)
for (i in 1:N)
{
 if(r[i] < 0.5 )
 { x[i,1] = "H" } else 
 { x[i,1] = "T" }
}

# Probability of Head 
P_H = length(which(x=="H"))/N
print(P_H)

# Probability of Tail 
P_T = length(which(x=="T"))/N
print(P_T)

We first generate 100 random numbers with equal probability, look through each number, and assign H or T based on whether the number is less than 0.5 or not, and then count the number of heads and tails. Some of you remember the which and length commands from lesson 5 and lesson 8.

The beauty of using R is that it offers shortcuts for lengthy procedures. What we did using the for loop above can be much easier using the sample command. Here is how we do that.

We first create a vector with H and T names (outcomes of a toss) to select from.

# create a vector with H and T named toss to select from 
toss = c("H","T")

Next, we can use the sample command to draw a value from the vector. This function is used to randomly sample a value from the vector (H T), just like tossing.

# use the command sample to sample 1 value from toss with replacement 
sample(toss,1,replace=T) # toss a coin 1 time

You can also use it to sample/toss multiple times, but be sure to use replace = TRUE in the function. That way, the function replicates the process of drawing a value, putting it back and drawing again, so the observations do not change.

# sample 100 times will toss a coin 100 times. use replace = T 
sample(toss,100,replace=T) # toss a coin 100 time

 [1] "H" "H" "H" "H" "T" "T" "H" "T" "T" "H" "T" "H" "H" "H" "T" "T" "H" "H"
 [19] "T" "H" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "T" "H" "H" "H" "T" "H"
 [37] "H" "T" "T" "T" "T" "T" "H" "T" "T" "H" "T" "H" "H" "T" "H" "H" "T" "H"
 [55] "H" "T" "H" "T" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "H" "T" "H" "T"
 [73] "T" "H" "H" "T" "H" "H" "T" "T" "T" "T" "H" "T" "H" "T" "T" "H" "T" "H"
 [91] "H" "T" "T" "H" "T" "H" "T" "H" "H" "T"

To summarize the number of heads and tails, you can use the table command. It is a bean counter. You should expect 50 H and 50 T from a large number of tosses.

# to summarize the values use the table command.
x = sample(toss,100,replace=T) # toss a coin 100 times
table(x) # summarize the values in x

 H T 
51 49
A large number of simulations

Set up an experiment with N = 5, 15, 25, 35, 45, … , 20000 and
record the probability of obtaining a head in each case. Type the following lines in your code and execute. See what you get.

# Large Numbers Experiment 
# Tossing Coin

toss = c("H","T")

number_toss = seq(from = 5, to = 20000,by=10) # increasing number of tosses

P_H = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

P_T = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number_toss))
{
 x = sample(toss,number_toss[i],replace=T)
 P_H[i,1]=length(which(x=="H"))/number_toss[i]
 P_T[i,1]=length(which(x=="T"))/number_toss[i] 
}

plot(number_toss,P_H,xlab="Number of Tosses",ylab="Probability of Head",type="l",font=2,font.lab=2)
abline(h=0.5,col="red",lwd=1)

Notice, for a small number of coin tosses, the probability of head has high variability (not stable). But as the number of coin tosses increases, the probability converges and stabilizes at 0.5.

DICE

You should be able to write the code for this using the same logic. Try it. Create a vector [1, 2, 3, 4, 5, 6] and sample from this vector with replacement.

No cheating by looking at my code right away.

Here is the large number experiment on this. Did you see the convergence to 1/6?

Pebbles

Let us first create a vector of size 10 (an urn with pebbles) with six white and four black pebbles, assuming the true ratio of white and black is 3/2. Run the sampling experiment on this vector by drawing more and more samples and estimate the probability of white pebble in the long run.

# Show that the probability of getting the color stones approaches the true probability
# Pebbles

stone = c("W","W","W","W","W","W","B","B","B","B")

number = seq(from = 5, to = 20000,by=10) # increasing number of draws

P_W = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time
P_B = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number))
{
 x = sample(stone,number[i],replace=T)
 P_W[i,1]=length(which(x=="W"))/number[i]
 P_B[i,1]=length(which(x=="B"))/number[i]
}

plot(number,P_W,type="l",font=2,font.lab=2,xlab="Number of Draws with Replacement",ylim=c(0,1),ylab="Probability of a White Pebble")
abline(h=(0.6),col="red")

If there were truly three white to two black pebbles in the world, an infinite (very very large) experiment/simulation would reveal a convergence at this number. This large number experiment is our quest for the truth, a process to surmise randomness.

You can now go and look at the full code and have fun programming it yourself. I will close off with this quote from James Bernoulli’s Ars Conjectandi.

“Whence, finally, this one thing seems to follow: that if observations of all events were to be continued throughout all eternity, (and hence the ultimate probability would tend toward perfect certainty), everything in the world would be perceived to happen in fixed ratios and according to a constant law of alternation, so that even in the most accidental and fortuitous occurrences we would be bound to recognize, as it were, a certain necessity and, so to speak, a certain fate.”

I continue to wonder what that “certain fate” is for me.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 21 – Beginners guide to summarize data in R

We went nine lessons without R. During this time we had lessons that taught us how to visualize data using dotplots, boxplots, and frequency plots (lesson 14 and lesson 15). We had lessons on summarizing the data using average (lesson 16), standard deviation (lesson 17), skewness (lesson 18), and outliers (lesson 19). We also had a lesson on comparing datasets using coefficient of variation (lesson 20). Don’t you think it is time to put these things into practice and know how to use R commands to explore data?

Let’s get going. Today’s victim is the natural gas consumption data in New York City available at the zip code level. The Mayor’s office of long-term planning and sustainability provides this data.

Usual Chores

Step 1: Get the data
You can download the data file from here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson21”. The downloaded data file “Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv” will be saved in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson21” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson21_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson21” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
Since we have a comma separated values file (.csv), we can use the “read.csv” command to read the data file into R workspace. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the datafile #
naturalgas_data = read.csv("Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv",header=TRUE)

Step 6: Let us begin the exploration
I realized that the data is classified into various building types → commercial, industrial, residential, institutional, etc. Let us extract three building types for this lesson. You can use the “which” command to do this filtering.

# extract subset of the data #
index1 = which(naturalgas_data$Building.type..service.class == "Large residential")

index2 = which(naturalgas_data$Building.type..service.class == "Commercial")

index3 = which(naturalgas_data$Building.type..service.class == "Institutional")

Residential = naturalgas_data$Consumption..GJ.[index1]
Commercial = naturalgas_data$Consumption..GJ.[index2]
Institutional = naturalgas_data$Consumption..GJ.[index3]

The “which” command will identify all the rows that belong to a class. We then extract these rows from the original data.

Now we have all large residential building consumption data under the “Residential“, all commercial building consumption data under the “Commercial” and all institutional building consumption data under the “Institutional“.

Let us look at Large residential consumption data first and then compare it to the others.

Visualize the data using dot plot

As a first step, we can order the data from smallest to largest and place the numbers as points on the line. This dot plot provides a good visual perspective on the range of the data. You can use the following command to do this.

stripchart(Residential,font=2,pch=21,cex=0.5,xlab="Consumption in GJ",font.lab=2)

You can change the values for the font to make it bold or light, pch for different shapes of the dots, and cex for changing the size of the dots. You can have customized labels using xlab.

Visualize the data using boxplot

With boxplot, we can get a nice visual of the data range and its percentiles along with detecting the outliers in the data.

boxplot(Residential, horizontal=TRUE, col="grey",add=T)

You can pick a color of your choice, and make it horizontal or vertical. I usually prefer the box to be horizontal as it aligns with the number line. You must have noticed the add=T at the end. This operation will tell R to add the boxplot on top on the dot plot. Did you see how the dots and the box aligned?

The box is the region with 50 percent of the data. The vertical line in the box is the 50th percentile (middle data point — median). Whiskers are extended from the box to the higher and lower percentiles. This extension is usually 1.5 times the length of the box. The length of the box is called the interquartile range (75th percentile minus 25th percentile). Points that cannot be reached using the whiskers are outliers.

Visualize the data using frequency plot

For making a frequency plot, we should first partition the data into groups or bins, count the number of data points that fall in each bin and stack building blocks for each bin. All this is done in R using the command:

# histogram #
hist(Residential,font=2,font.lab=2,xlab="Consumption in GJ",col="grey")

Again, you can work with font, xlab, etc. to beautify the plot.

Did you see that the data is not symmetric? There are extreme values on the right creating a positive skew. The average and 50th percentile (median) are not the same.

Summarizing the data using summary statistics

We can summarize the data using the summary statistics → average, standard deviation, and skewness. We can also compute the percentiles.

Average or mean is a measure of the central tendency (balance point). You make a pass through the numbers and add them up, then divide this total by the number of data points. In R, the command for this is:

# average #
mean(Residential,na.rm=T)
[1] 684942.3

Notice that I added an argument na.rm = T since I have some missing values in the data. na.rm is instructing R to remove the “NA” and then compute the average.

You must be thinking about the fact that mean is sensitive to outliers. Yes, we have seen that the data is skewed, so perhaps computing the middle value (median) is necessary. The median is not sensitive to outliers. 50 percent of the data points are less than this number. The command for this in R is:

# median #
median(Residential,na.rm=T)
[1] 414505

While the average provides a measure of the center of the data, variance or standard deviation provides a measure of how spread out the data is from the center. Compute the deviation of each number from the average. Square these deviations. Get the average of all these squared deviations. In R you can do this using the command:

# variance and standard deviation #
var(Residential,na.rm=T)
[1] 707890274303

sd(Residential,na.rm=T)
[1] 841362.2

As you know, the standard deviation is in the same units (gigajoules) as the average.

The third measure, the skewness can be computed as follows:

# skewness #
library(moments)
skewness(Residential,na.rm=T)
[1] 1.990704

The function skewness is part of a package called “moments“. If you do not have this library, you should first install it using:

install.packages("moments")

After installing, you can call this library using:

library(moments)

The skewness for this data is 1.99. Very high as you saw in the histogram. If the data is symmetric, i.e. evenly spread out around the center, the skewness will be 0.

The “quantiles” command will give us the 0th, 25th, 50th, 75th and the 100th percentiles. We can also use the “summary” command to get an overview of the data.

# quantiles and summary #
quantile(Residential,na.rm=T)

 0% 25% 50% 75% 100% 
 30 65247 414505 998733 4098510 

summary(Residential)

 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
 30 65250 414500 684900 998700 4099000 1
Comparing the data using plots and coefficient of variation

An easy way to compare the data is to align them on the same scale and visually inspect for differences. In other words, you can plot two or more boxplots on the same scale like this:

# comparing the data # 
all_data = cbind(Residential,Commercial, Institutional)
boxplot(all_data,horizontal=T)

You must have noticed that I am first combining the data vectors into a new frame through the “cbind command. cbind will bind the data as columns. (column bind — cbind). We can now plot this new data frame with three columns.

There are clear differences in the data. Large residential buildings have more energy consumption. Not surprising as there are more residential buildings than commercial or institutional.

Last week, we went over the coefficient of variation. Can you now tell me which consumption data (residential, commercial or institutional) has more variation compared to its mean?

Did you notice that the commercial and institutional has more variation compared to residential? Do you know why?

While you are at that, also think about why using graphical methods and summary statistics to summarize the data is an important feature for statistical modeling.

Could it be that there is a universal behavior and we are using a sample of that population?

Can a sample represent the population?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 11 – Fran, the functionin’ R-bot

Hello, my name is Fran. I am a function in R.

 

 Hello, my name is D. I am a functionin’ person.

 

I can perform any task you give me.

 

Interesting, can you add two numbers?

 

Yes, I can.

 

Can you tell me more about how you work?

 

Sure, I need the inputs and the instruction, i.e. what you want me to do with the inputs.

Okay. I am giving you two numbers, 10 and 15. Can you show me how you   will create a function to add them?

 

This is easy. Let me first give you the structure.

# structure of a function #
functionname = function(inputs)
{
 instructions
 
 return(output)
}

You can select these lines and hit the run button to load the function in R. Once you execute these lines, the function will be loaded in R, and you can use this name with any inputs.

Let us say the two numbers are a and b. These numbers are provided as inputs. I will first assign a name to the function — “add”. Since you are asking me to add two numbers, the instruction will be y = a + b, and I will return the value of y.

Here is a short video showing how to create a function to add two numbers a and b. You can try it in your RStudio program.

 Neat. If I give you three numbers, m, x, and b, can you write a function for mx + b?

 

Yes. I believe you are asking me to write a function for a straight line: y = mx + b. I will assign “line_eq” as the name of the function, the inputs will be m, x, and b and the output will be y.

# function for line equation #
line_eq = function (m, x, b)
{
 # m is the slope of the line 
 # b is the intercept of the line
 # x is the point on the x-axis
 
 y = m*x + b # equation for the line 
 
 return(y) 
}

# test the function #
line_eq(0.5, 5, 10)
> 12.5

Can you perform more than one task? For example, if I ask you for y = mx + b and x + y, can return both the values?

 

Yes, I can. I will have two instructions. In the end, I will combine both the outputs into one vector and return the values. Here is how I do it.

# function for line equation + (x + y) #
two_tasks = function (m, x, b)
{
 # m is the slope of the line 
 # b is the intercept of the line
 # x is the point on the x-axis
 
 y = m*x + b # equation for the line 
 
 z = x + y
 
 return(c(y,z)) 
}

# test the function #
two_tasks(0.5, 5, 10)
> 12.5 17.5

Very impressive. What if some of the inputs are numbers and some of them are a set of numbers? For instance, if I give you many points on the x-axis, m and b, the slope and the intercept, can you give me the values for y?

 

No problemo. The same line_eq function will work. Let us say you give me some numbers x = [1, 2, 3, 4, 5], m = 0.5 and b = 10. I will use the same function line_eq(m, x, b).

# use on vectors #
x = c(1,2,3,4,5)
m = 0.5
b = 10

line_eq(m,x,b)
> 10.5 11.0 11.5 12.0 12.5

I am beginning to like you. But, maybe you are fooling me with simple tricks. I don’t need a robot for doing simple math.

 

Hey, my name is Fran 😡

 

Okay Fran. Prove to me that you can do more complicated things.

 

Bring it on.

 

 It is springtime, and I’d love to get a Citi bike and ride around the city. I want you to tell me how many people rented the bike at the most popular route, the Central Park Southern Loop and the average trip time.

 

aargh… your obsession with the city. Give me the data.

 

Here you go. You can use the March 2017 file. They have data for the trip duration in seconds, check out time and check in time, start station and end station.

Alright. I will name the function “bike_analysis.” The inputs will be the data for the bike ridership for a month, and the name of the station. The function will identify how many people rented the bikes at the Central Park S station and returned it back to the same station — completing the loop. You asked me for total rides and the average trip time. I threw in the maximum and minimum ride time too. You can use this function with data from any month and at any station.

# function to analyze bike data # 
bike_analysis = function(bike_data,station_name)
{
 dum = which (bike_data$Start.Station.Name == station_name &    bike_data$End.Station.Name == station_name)
 total_rides = length(dum)
 
 average_time = mean(bike_data$Trip.Duration[dum])/60 # in minutes 
 max_time = max(bike_data$Trip.Duration[dum])/60 # in minutes 
 min_time = min(bike_data$Trip.Duration[dum])/60 # in minutes
 
 output = c(total_rides,average_time,max_time,min_time)
 return(output)
}

# use the function to analyze Central Park South Loop #

# bike data # 
bike_data = read.csv("201703-citibike-tripdata.csv",header=T)

station_name = "Central Park S & 6 Ave"

bike_analysis(bike_data,station_name)
> 212.000000  42.711085 403.000000   1.066667

212 trips, 42 minutes of average trip time. The maximum trip time is 403 minutes and the minimum trip time is ~ 1 minute. Change of mind?

Wow. You are truly helpful. I would have spent a lot of time if I were to do this manually. I can use your brains and spend my holiday weekend riding the bike.

 

Have fun … and Happy Easter.

 

How did you know that?

 

Machine Learning man 😉

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 8 – The search ‘for’ w’if’i

As I am taking my usual Broadway stroll this morning, I noticed some people immersed in their smartphone, texting or browsing while walking. Guilty as charged. It reminded me of the blog post a few years ago about my memory down the phone lane.

I started paying attention to my surroundings and noticed this kiosk, a free wifi hotspot provided by the City of New York.

So, I used their wifi to look up LinkNYC. Thier description tells me that “LinkNYC is a first-of-its-kind communications network that will replace over 7,500 pay phones across the five boroughs with new structures called Links. Each Link provides superfast, free public Wi-Fi, phone calls, device charging and a tablet for access to city services, maps and directions.”

They have been around since January 2016. I want to know how many such kiosks are there in the City and each borough. They have a map on their website for locating the kiosks, but it is hard for me to count points on a map. My City at service again, I found the entire wifi hotspot locations data on NYC Open Data.

Let us learn some tricks in R while trying to answer this question.

Usual chores.

Step 1: Get the data
I downloaded the hotspot data file from here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson8”. The downloaded data file “Free_WiFi_Hotspots_09042005.csv” is saved in this folder.

Step 3: Create a new code in R
Create a new code for this lesson – “lesson8_code.R”.

Step 4: Choose your working directory
In this lesson, we have a folder named “lesson8”. So we should tell R that “lesson8” is the folder where the data files are stored. You can do this using the “setwd” command. If you followed the above steps and if you type list.files() in the console, you would see “lesson8_code.R” and “Free_WiFi_Hotspots_09042005.csv” on the screen as the listed files in your folder.

Step 5: Read the data into R workspace
Since we have a comma separated values file (.csv), we can use the “read.csv” command to read the data file into R workspace. Type the following line in your code and execute it. Why am I giving header=TRUE in the command?

# Read the data file #
wifi_data = read.csv("Free_WiFi_Hotspots_09042005.csv",header=T)

Step 6: Use the data to answer the questions
It will be helpful to learn some new concepts in R coding before we address the actual kiosk problem. The for loop, storing values in matrices, and if else statements.

Loops: Do you remember your fun trips to batting cages? Just like the pitching machine is set up to pitch the baseball, again and again, you can instruct R to run commands again and again. For example, if you want to print NYC ten times, instead of typing it ten times, you can use the for loop.

# print NYC 10 times
for(i in 1:10) 
{
 print("NYC")
}

You are instructing R to print NYC ten times by using the command for (i in 1:10). The lines within the {} will be repeated ten times. Select the lines and hit the “Run” button to execute. You will see this in the console screen.

[1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"

Matrices: I am sure you have been to a college or public library to check out books. Can you remember the bookcases that store the books? Just like a bookshelf with many horizontal and vertical dividers, in R you can create matrices with many rows and columns and use them to store values (numbers or text). Type the following lines in your code.

# create an empty matrix of NA in 10 rows and 1 column
x = matrix(NA,nrow=10,ncol=1)
print(x)

You can create an empty matrix called x with ten rows and 1 column. The NA is a space. We can fill this space with numbers later.

Now imagine we combine the above instructions, we want to store NYC in a matrix or shelf; like printing NYC ten times on a paper and arranging them in each row of a bookshelf. Type the following lines and see the results for yourself. The empty matrix x will be filled with NYC using the for loop.

# store NYC in a matrix of 10 rows and 1 columns
x <- matrix(NA,nrow=10,ncol=1)
for (i in 1:10)
{
 x[i,1] = "NYC"
}
print(x)

If Else Statement: Do you remember cleaning your room on a Sunday? Recall those terrible times for a minute and think about what you did. If you find a book lying around, you would have put it in the bookshelf. If you find a toy, you may have put it in your toy bag. If you find a shirt or a top, it goes into the closet. The if else statement works exactly like this. You are instructing R to perform an action if some condition is true, else some other work.

if (condition) {statement} else {statement}

For example, let us say we want to print a number 1 if the 10th row of the matrix x is “NYC”; else, we want to print a number 0. We can do this using the following lines.

# if else statement #
if (x[10,1] == "NYC") {print(1)} else {print (0)}

The conditon is x[10,1] == “NYC”, the output is printing 1 or 0.

Okay, let us get back to the wifi hotspot business. If you look at the Free_WiFi_Hotspots_09042005.csv file, you will notice that the 4th column is an indicator for which borough the kiosk is in, and the 5th column is the indicator for the service provider. LinkNYC is not the only free wifi provider in the city. Time Warner, AT and T, Cable Vision are some of the other providers.

So here is a strategy or a set of instructions we can give R to see how many LinkNYC kiosks are there is Manhattan. We first check for the total number of kiosks. There are 2061 kiosks in the City (number of rows of the table). Wow..

We look through each row (kiosk) and check if the provider is LinkNYC, and the borough is Manhattan. If it is the case, we assign a 1; else we assign a 0. We count the total number of ones. Here is the code.

## finding NYC LINK kiosks in Manhattan ##
n = nrow(wifi_data)

linknyc_hotspots = matrix(NA,nrow=n,ncol=1)

for (i in 1:n)
{
 if((wifi_data[i,4]=="MN") & (wifi_data[i,5]=="BETA LinkNYC - Citybridge")) 
{linknyc_hotspots[i,1] = 1} else {linknyc_hotspots[i,1] = 0}
}

sum(linknyc_hotspots)

Notice what I did.

I first created an empty matrix with 2061 rows and 1 column.

Using the for loop, I am looking through all the rows.

For each row, I have a condition to check – the provider is LinkNYC, and the borough is Manhattan. Notice how I check for the conditions using the (condition 1) & (condition 2).
If these conditions are true, I will fill the empty matrix with 1, else, I will fill it with 0.

In the end, I will count the total number of ones.

There are 521 LinkNYC kiosks in Manhattan. Still Wow..

Can you now tell me how many LinkNYC kiosks are there in Brooklyn and how many Time Warner Cable kiosks are there in the Bronx?

While you tell me the answer, I want to sing Frank Sinatra for the rest of the day. I love my City. I can take long walks; I can be car free; I can get free wifi access; I can even get free data to analyze.

Wait, maybe I should not use “free” in my expression of the love for the City.

April 18th is fast approaching.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 5 – Let us ‘R’eview

I am staring at my MacBook, thinking about the opening lines. As my mind wanders, I look over the window to admire the deceivingly bright sunny morning. The usually pleasant short walk to the coffee shop was rather uncomfortable. I ignored the weatherman’s advice that it will feel like 12 degrees today. Last week, my friend expressed his concern that February was warm. I begin to wonder how often is Saturday cold and how many days in the last month were warm. Can I look at daily temperature data and find these answers? Maybe, if I can get the data, I can ask my companion R to help me with this. Do you think I can ask R to find me the set of all cold days, the set of all warm days and the set of cold Saturdays? Let us check out. Work with me using your RStudio program.

Step 1: Get the data
I obtained the data for this review from the National Weather Service forecast office, New York. For your convenience, I filtered the data with our goals in mind. You will notice that the data is in a table with rows and columns. Each row is a day. The columns indicate the year, month, day, its name, maximum temperature, minimum temperature and average temperature. According to the National Weather Service’s definition, the maximum temperature is the highest temperature for the day in degrees Fahrenheit, the minimum temperature is the lowest temperature for the day, and the average temperature for the day is the rounded value of the average of the maximum and the minimum temperature.

Step 2: Create a new folder on your computer
When you are working with several data files, it is often convenient to have all the files in one folder on your computer. You can instruct R to read the input data from the folder. Download the “nyc_temperature.txt” file into your chosen folder. Let us call this folder “lesson5”.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”.
Save the code in the same folder “lesson5” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “code_lesson5.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
Make it a practice to start your code with the first line instructing R to set the working directory to your folder. In this lesson, we have a folder named “lesson5”. So we should tell R that “lesson5” is the folder where the data files are stored. You can do this using the “setwd” command.

The path to the folder is given within the quotes. Execute the line by clicking the “Run” button on the top right. When this line is executed, R will read from “lesson5” folder. You can check this by typing “list.files()” in the console. The “list.files()” command will show you the files in the folder. If you followed the above steps, your would see “code_lesson5.R” and “nyc_temperature.txt” on the screen as the listed files in your folder.

Step 5: Read the data into R workspace
The most common way to read the data into R workspace is to use the command “read.table”. This command will import the data from your folder into the workspace. Type the following line in your code and execute it.

Notice that I am giving the file name in quotes. I am also telling R that there is a header (header=TRUE) for the data file. The header is the first row of the data file, the names of each column. If there is no header in the file, you can choose header=FALSE.

Once you execute this line, you will see a new name (nyctemperature) appearing in the environment space (right panel). We have just imported the data file from the “lesson5” folder into R.

Step 6: Use the data to answer the questions
Let us go back to the original questions. How many days in the last month were warm, and how often is Saturday cold.

Let us call data for the months of January and February as the sample space S. S is the set of all data for January and February. Type the following lines in your code to define sample space S.

Notice that S is a table/matrix with rows and columns. In R, S[1,1] is the element in the first row and first column. S[1,7] is the element in the first row and seventh column, i.e. the average temperature data for the first day. If you want to choose the entire first row, you can use S[1, ] (1 followed by a comma followed by space within the square brackets). If you want to select an entire column, for instance, the average temperature data (column 7), you can use S[ ,7] (a space followed by a comma followed by the column number 7).

To address the first question, we should identify warm days in February. We need to define a set A for all February data, and a set B for warm days.

Recall lesson 4 and check whether A and B are subsets of S.

Type the following lines in your code to define set A.

S[ ,2] is selecting the second column (month) from the sample space S. The “which(S[ ,2]==2)” will identify which rows in the month column are equal to 2, i.e. we are selecting the February days. Notice that A will give you numbers 32 to 59, the 32nd row (February 1) to 59th row (February 28) in the data.

Next, we need to define set B as the warm days. For this, we should select a criterion for warm days. For simplicity, let us assume that a warm day is any day with an average temperature greater than or equal to 50 degrees F. Let us call this set B = set of all warm days. Type the following lines in your code and execute to get set B.

S[ ,7] is selecting the 7th column (average temperature) from the sample space S. The “which(S[ ,7]>=50)” will identify which rows have an average temperature greater than or equal to 50. Notice that B will give your numbers 12, 26, 39, 50, 53, 54, 55, 56, and 59; the rows (days) when the average temperature is greater than or equal to 50 degrees F. February 28th, 2017 had an average temperature of 53 degrees F. I believe it was that day when my friend expressed his unhappiness about the winter being warm!

Now that we have set A for all February data, and set B for warm days, we need to identify how many elements are common in A and B; what is the intersection of A and B. The intersection will find the elements common to both the sets (Recall Lesson 4 – intersection = players interested in basketball and soccer). Type the following line to find the intersection of A and B.

The “intersect” command will find the common elements of A and B. You will get the numbers 39, 50, 53, 54, 55, 56, and 59. The days in February that are warm. Seven warm days last month — worthy of my friend’s displeasure.

Can you now tell me how often is Saturday cold based on the data we have? Assume cold is defined as an average temperature less than or equal to 25 degrees F.

Did you realize that I am just whining about Saturday being cold? Check out the “union” command before you sign off.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 2 – R is your companion

  is your companion in this journey through data. It is a software environment that performs the analysis and creates plots for us upon instruction. People who are familiar with computer programming need no introduction. Others who are just getting started with data analysis but are skeptical about computer programming – count the number of 1’s in this data. You will need no more convincing to embrace R.

You can follow these steps to install R and its partner RStudio.

1) Installing R: Use this link https://cran.r-project.org to download and install R on your computer. There are separate download links for Mac, Windows or Linux users.

2) Installing RStudio: Use this link https://www.rstudio.com/products/rstudio/download/ to download and install RStudio on your computer. RStudio will be your environment for coding and visualizations. For RStudio to run, you need to install R; which is why we followed step 1.

Getting started with RStudio
Open RStudio – you should see three panels, the console, environment and history and the files panels.

In the console panel, you will get a copyright message and a “>” prompt. Type 1+2 and hit enter here to check that RStudio is working. You should see the number 3 pop up, which means you are ready to go.

Writing your code
You can use a text editor to give instructions to RStudio (writing your code) and save those instructions (saving your code). In the text editor, you can also write comments to help you remember the steps and make it user-friendly and readable. As a simple example, imagine you want to add 2 numbers, 1 and 2 and you want to write instructions for this. You can type

The first line starts with a #; it is a comment — for people to know that you are adding 2 numbers.

The second line is the actual code or instruction to add 1 and 2.

Remember, a well-written code with clear comments is like having good handwriting.

Opening the text editor
Once you load RStudio, go to “File >> New >> R script”.
You will now see the text editor as the 4th panels on RStudio. You can save the code using the “save” button or by using “Ctrl+S”. Use .R as the extension — “code.R”.

Some simple code
Type the following lines in your code and execute them by clicking the “Run” button on the top right.

Shortcut: You can also place the cursor on the line and hit “Ctrl+Enter” if you are using Windows or “Cmd+Enter” if you are using a Mac. Notice the results of your code in the console.

I have more simple instructions for practice here. Copy them to your code and have fun coding.

We will learn data analysis using RStudio. I will provide more coding tricks as we go along.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.