Lesson 29 – Large number games in R

“it cannot escape anyone that for judging in this way about any event at all, it is not enough to use one or two trials, but rather a great number of trials is required. And sometimes the stupidest man — by some instinct of nature per se and by no previous instruction (this is truly amazing) — knows for sure that the more observations of this sort that are taken, the less danger will be of straying from the mark.”

wrote James Bernoulli, a Swiss mathematician in his thesis Ars Conjectandi in 1713.

Suppose we have an urn with an unknown number of white and black pebbles. Can we take a sample of 10 pebbles, count the number of white and black ones, and deduce the proportion in the container? If 10 is too small a sample, how about 20? How small is too small? James Bernoulli used this illustration to understand, through observations, the likelihood of diseases in human body.

How many sample observations do we need before we can confidently say something is true?

In lesson 6, and lesson 23, we have seen that the probability of an event is its long run relative frequency. How many observations should we get before we know the true probability?

Do data from different probability distribution functions take the same amount of sample observations to see this convergence to the truth?

Do we even know if what we observe in the long run is the actual truth? Remember black swans did not exist till they did.

We will let the mathematicians and probability theorists wrestle with these questions. Today, I will show you some simple tricks in R to simulate large number games. We will use three examples;

tossing a coin,

rolling a dice, and

drawing pebbles from the urn.

All the code in R is here. Let’s get to the nuts and bolts of our toy models.

Coin Toss

If you toss a coin, the probability of getting a head or a tail is 0.5. The two outcomes are independent. By now you should know that it is 0.5 in the long run. In other words, if you toss a coin many times and count the number of times you get head and the number of times you get tails, the probability of head, i.e., the number of heads/N ~ 0.5.

Let us create this experiment in R. Since there is 0.5 probability of getting head or tail; we can use the random number generator to simulate the outcomes. Imagine we have a ruler on the scale of 0 to 1, and we simulate a random number with equal probability, i.e., any number between 0 and 1 is equally possible. If the number is less than 0.5, we assume heads, if it is greater than 0.5, we assume tails.

A random number between 0 and 1 with equal probability can be generated using the runif command in R.

# Generate 1 uniform random number 
r = runif(1)

[1] 0.3627271
# Generate 10 uniform random number 
r = runif(10)

[1] 0.7821440 0.8344471 0.3977171 0.3109202 0.8300459 0.7474802 0.8777750
 [8] 0.7528353 0.9098839 0.3731529
# Generate 100 uniform random number 
r = runif(100)

The steps for this coin toss experiment are:
1) Generate a uniform random number r between 0 – 1
2) If r < 0.5 choose H
3) If r > 0.5, choose T
4) Repeat the experiment N times → N coin tosses.

Here is the code.

# coin toss experiment 
N = 100

r = runif(N)

x = matrix(NA,nrow = N,ncol = 1)
for (i in 1:N)
{
 if(r[i] < 0.5 )
 { x[i,1] = "H" } else 
 { x[i,1] = "T" }
}

# Probability of Head 
P_H = length(which(x=="H"))/N
print(P_H)

# Probability of Tail 
P_T = length(which(x=="T"))/N
print(P_T)

We first generate 100 random numbers with equal probability, look through each number, and assign H or T based on whether the number is less than 0.5 or not, and then count the number of heads and tails. Some of you remember the which and length commands from lesson 5 and lesson 8.

The beauty of using R is that it offers shortcuts for lengthy procedures. What we did using the for loop above can be much easier using the sample command. Here is how we do that.

We first create a vector with H and T names (outcomes of a toss) to select from.

# create a vector with H and T named toss to select from 
toss = c("H","T")

Next, we can use the sample command to draw a value from the vector. This function is used to randomly sample a value from the vector (H T), just like tossing.

# use the command sample to sample 1 value from toss with replacement 
sample(toss,1,replace=T) # toss a coin 1 time

You can also use it to sample/toss multiple times, but be sure to use replace = TRUE in the function. That way, the function replicates the process of drawing a value, putting it back and drawing again, so the observations do not change.

# sample 100 times will toss a coin 100 times. use replace = T 
sample(toss,100,replace=T) # toss a coin 100 time

 [1] "H" "H" "H" "H" "T" "T" "H" "T" "T" "H" "T" "H" "H" "H" "T" "T" "H" "H"
 [19] "T" "H" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "T" "H" "H" "H" "T" "H"
 [37] "H" "T" "T" "T" "T" "T" "H" "T" "T" "H" "T" "H" "H" "T" "H" "H" "T" "H"
 [55] "H" "T" "H" "T" "H" "H" "H" "T" "T" "T" "T" "H" "H" "H" "H" "T" "H" "T"
 [73] "T" "H" "H" "T" "H" "H" "T" "T" "T" "T" "H" "T" "H" "T" "T" "H" "T" "H"
 [91] "H" "T" "T" "H" "T" "H" "T" "H" "H" "T"

To summarize the number of heads and tails, you can use the table command. It is a bean counter. You should expect 50 H and 50 T from a large number of tosses.

# to summarize the values use the table command.
x = sample(toss,100,replace=T) # toss a coin 100 times
table(x) # summarize the values in x

 H T 
51 49
A large number of simulations

Set up an experiment with N = 5, 15, 25, 35, 45, … , 20000 and
record the probability of obtaining a head in each case. Type the following lines in your code and execute. See what you get.

# Large Numbers Experiment 
# Tossing Coin

toss = c("H","T")

number_toss = seq(from = 5, to = 20000,by=10) # increasing number of tosses

P_H = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

P_T = matrix(NA,nrow=length(number_toss),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number_toss))
{
 x = sample(toss,number_toss[i],replace=T)
 P_H[i,1]=length(which(x=="H"))/number_toss[i]
 P_T[i,1]=length(which(x=="T"))/number_toss[i] 
}

plot(number_toss,P_H,xlab="Number of Tosses",ylab="Probability of Head",type="l",font=2,font.lab=2)
abline(h=0.5,col="red",lwd=1)

Notice, for a small number of coin tosses, the probability of head has high variability (not stable). But as the number of coin tosses increases, the probability converges and stabilizes at 0.5.

DICE

You should be able to write the code for this using the same logic. Try it. Create a vector [1, 2, 3, 4, 5, 6] and sample from this vector with replacement.

No cheating by looking at my code right away.

Here is the large number experiment on this. Did you see the convergence to 1/6?

Pebbles

Let us first create a vector of size 10 (an urn with pebbles) with six white and four black pebbles, assuming the true ratio of white and black is 3/2. Run the sampling experiment on this vector by drawing more and more samples and estimate the probability of white pebble in the long run.

# Show that the probability of getting the color stones approaches the true probability
# Pebbles

stone = c("W","W","W","W","W","W","B","B","B","B")

number = seq(from = 5, to = 20000,by=10) # increasing number of draws

P_W = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time
P_B = matrix(NA,nrow=length(number),ncol=1) # create an empty matrix to fill probability each time

for (i in 1:length(number))
{
 x = sample(stone,number[i],replace=T)
 P_W[i,1]=length(which(x=="W"))/number[i]
 P_B[i,1]=length(which(x=="B"))/number[i]
}

plot(number,P_W,type="l",font=2,font.lab=2,xlab="Number of Draws with Replacement",ylim=c(0,1),ylab="Probability of a White Pebble")
abline(h=(0.6),col="red")

If there were truly three white to two black pebbles in the world, an infinite (very very large) experiment/simulation would reveal a convergence at this number. This large number experiment is our quest for the truth, a process to surmise randomness.

You can now go and look at the full code and have fun programming it yourself. I will close off with this quote from James Bernoulli’s Ars Conjectandi.

“Whence, finally, this one thing seems to follow: that if observations of all events were to be continued throughout all eternity, (and hence the ultimate probability would tend toward perfect certainty), everything in the world would be perceived to happen in fixed ratios and according to a constant law of alternation, so that even in the most accidental and fortuitous occurrences we would be bound to recognize, as it were, a certain necessity and, so to speak, a certain fate.”

I continue to wonder what that “certain fate” is for me.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 21 – Beginners guide to summarize data in R

We went nine lessons without R. During this time we had lessons that taught us how to visualize data using dotplots, boxplots, and frequency plots (lesson 14 and lesson 15). We had lessons on summarizing the data using average (lesson 16), standard deviation (lesson 17), skewness (lesson 18), and outliers (lesson 19). We also had a lesson on comparing datasets using coefficient of variation (lesson 20). Don’t you think it is time to put these things into practice and know how to use R commands to explore data?

Let’s get going. Today’s victim is the natural gas consumption data in New York City available at the zip code level. The Mayor’s office of long-term planning and sustainability provides this data.

Usual Chores

Step 1: Get the data
You can download the data file from here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson21”. The downloaded data file “Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv” will be saved in this folder.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”. Save the code in the same folder “lesson21” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “lesson21_code.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
“lesson21” is the folder where we stored the data file. Use “setwd(“path”)” to set the path to this folder. Execute the line by clicking the “Run” button on the top right.

setwd("path to your folder")

Step 5: Read the data into R workspace
Since we have a comma separated values file (.csv), we can use the “read.csv” command to read the data file into R workspace. Type the following line in your code and execute it. Use header=TRUE in the command.

# Read the datafile #
naturalgas_data = read.csv("Natural_Gas_Consumption_by_ZIP_Code_-_2010.csv",header=TRUE)

Step 6: Let us begin the exploration
I realized that the data is classified into various building types → commercial, industrial, residential, institutional, etc. Let us extract three building types for this lesson. You can use the “which” command to do this filtering.

# extract subset of the data #
index1 = which(naturalgas_data$Building.type..service.class == "Large residential")

index2 = which(naturalgas_data$Building.type..service.class == "Commercial")

index3 = which(naturalgas_data$Building.type..service.class == "Institutional")

Residential = naturalgas_data$Consumption..GJ.[index1]
Commercial = naturalgas_data$Consumption..GJ.[index2]
Institutional = naturalgas_data$Consumption..GJ.[index3]

The “which” command will identify all the rows that belong to a class. We then extract these rows from the original data.

Now we have all large residential building consumption data under the “Residential“, all commercial building consumption data under the “Commercial” and all institutional building consumption data under the “Institutional“.

Let us look at Large residential consumption data first and then compare it to the others.

Visualize the data using dot plot

As a first step, we can order the data from smallest to largest and place the numbers as points on the line. This dot plot provides a good visual perspective on the range of the data. You can use the following command to do this.

stripchart(Residential,font=2,pch=21,cex=0.5,xlab="Consumption in GJ",font.lab=2)

You can change the values for the font to make it bold or light, pch for different shapes of the dots, and cex for changing the size of the dots. You can have customized labels using xlab.

Visualize the data using boxplot

With boxplot, we can get a nice visual of the data range and its percentiles along with detecting the outliers in the data.

boxplot(Residential, horizontal=TRUE, col="grey",add=T)

You can pick a color of your choice, and make it horizontal or vertical. I usually prefer the box to be horizontal as it aligns with the number line. You must have noticed the add=T at the end. This operation will tell R to add the boxplot on top on the dot plot. Did you see how the dots and the box aligned?

The box is the region with 50 percent of the data. The vertical line in the box is the 50th percentile (middle data point — median). Whiskers are extended from the box to the higher and lower percentiles. This extension is usually 1.5 times the length of the box. The length of the box is called the interquartile range (75th percentile minus 25th percentile). Points that cannot be reached using the whiskers are outliers.

Visualize the data using frequency plot

For making a frequency plot, we should first partition the data into groups or bins, count the number of data points that fall in each bin and stack building blocks for each bin. All this is done in R using the command:

# histogram #
hist(Residential,font=2,font.lab=2,xlab="Consumption in GJ",col="grey")

Again, you can work with font, xlab, etc. to beautify the plot.

Did you see that the data is not symmetric? There are extreme values on the right creating a positive skew. The average and 50th percentile (median) are not the same.

Summarizing the data using summary statistics

We can summarize the data using the summary statistics → average, standard deviation, and skewness. We can also compute the percentiles.

Average or mean is a measure of the central tendency (balance point). You make a pass through the numbers and add them up, then divide this total by the number of data points. In R, the command for this is:

# average #
mean(Residential,na.rm=T)
[1] 684942.3

Notice that I added an argument na.rm = T since I have some missing values in the data. na.rm is instructing R to remove the “NA” and then compute the average.

You must be thinking about the fact that mean is sensitive to outliers. Yes, we have seen that the data is skewed, so perhaps computing the middle value (median) is necessary. The median is not sensitive to outliers. 50 percent of the data points are less than this number. The command for this in R is:

# median #
median(Residential,na.rm=T)
[1] 414505

While the average provides a measure of the center of the data, variance or standard deviation provides a measure of how spread out the data is from the center. Compute the deviation of each number from the average. Square these deviations. Get the average of all these squared deviations. In R you can do this using the command:

# variance and standard deviation #
var(Residential,na.rm=T)
[1] 707890274303

sd(Residential,na.rm=T)
[1] 841362.2

As you know, the standard deviation is in the same units (gigajoules) as the average.

The third measure, the skewness can be computed as follows:

# skewness #
library(moments)
skewness(Residential,na.rm=T)
[1] 1.990704

The function skewness is part of a package called “moments“. If you do not have this library, you should first install it using:

install.packages("moments")

After installing, you can call this library using:

library(moments)

The skewness for this data is 1.99. Very high as you saw in the histogram. If the data is symmetric, i.e. evenly spread out around the center, the skewness will be 0.

The “quantiles” command will give us the 0th, 25th, 50th, 75th and the 100th percentiles. We can also use the “summary” command to get an overview of the data.

# quantiles and summary #
quantile(Residential,na.rm=T)

 0% 25% 50% 75% 100% 
 30 65247 414505 998733 4098510 

summary(Residential)

 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 
 30 65250 414500 684900 998700 4099000 1
Comparing the data using plots and coefficient of variation

An easy way to compare the data is to align them on the same scale and visually inspect for differences. In other words, you can plot two or more boxplots on the same scale like this:

# comparing the data # 
all_data = cbind(Residential,Commercial, Institutional)
boxplot(all_data,horizontal=T)

You must have noticed that I am first combining the data vectors into a new frame through the “cbind command. cbind will bind the data as columns. (column bind — cbind). We can now plot this new data frame with three columns.

There are clear differences in the data. Large residential buildings have more energy consumption. Not surprising as there are more residential buildings than commercial or institutional.

Last week, we went over the coefficient of variation. Can you now tell me which consumption data (residential, commercial or institutional) has more variation compared to its mean?

Did you notice that the commercial and institutional has more variation compared to residential? Do you know why?

While you are at that, also think about why using graphical methods and summary statistics to summarize the data is an important feature for statistical modeling.

Could it be that there is a universal behavior and we are using a sample of that population?

Can a sample represent the population?

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 11 – Fran, the functionin’ R-bot

Hello, my name is Fran. I am a function in R.

 

 Hello, my name is D. I am a functionin’ person.

 

I can perform any task you give me.

 

Interesting, can you add two numbers?

 

Yes, I can.

 

Can you tell me more about how you work?

 

Sure, I need the inputs and the instruction, i.e. what you want me to do with the inputs.

Okay. I am giving you two numbers, 10 and 15. Can you show me how you   will create a function to add them?

 

This is easy. Let me first give you the structure.

# structure of a function #
functionname = function(inputs)
{
 instructions
 
 return(output)
}

You can select these lines and hit the run button to load the function in R. Once you execute these lines, the function will be loaded in R, and you can use this name with any inputs.

Let us say the two numbers are a and b. These numbers are provided as inputs. I will first assign a name to the function — “add”. Since you are asking me to add two numbers, the instruction will be y = a + b, and I will return the value of y.

Here is a short video showing how to create a function to add two numbers a and b. You can try it in your RStudio program.

 Neat. If I give you three numbers, m, x, and b, can you write a function for mx + b?

 

Yes. I believe you are asking me to write a function for a straight line: y = mx + b. I will assign “line_eq” as the name of the function, the inputs will be m, x, and b and the output will be y.

# function for line equation #
line_eq = function (m, x, b)
{
 # m is the slope of the line 
 # b is the intercept of the line
 # x is the point on the x-axis
 
 y = m*x + b # equation for the line 
 
 return(y) 
}

# test the function #
line_eq(0.5, 5, 10)
> 12.5

Can you perform more than one task? For example, if I ask you for y = mx + b and x + y, can return both the values?

 

Yes, I can. I will have two instructions. In the end, I will combine both the outputs into one vector and return the values. Here is how I do it.

# function for line equation + (x + y) #
two_tasks = function (m, x, b)
{
 # m is the slope of the line 
 # b is the intercept of the line
 # x is the point on the x-axis
 
 y = m*x + b # equation for the line 
 
 z = x + y
 
 return(c(y,z)) 
}

# test the function #
two_tasks(0.5, 5, 10)
> 12.5 17.5

Very impressive. What if some of the inputs are numbers and some of them are a set of numbers? For instance, if I give you many points on the x-axis, m and b, the slope and the intercept, can you give me the values for y?

 

No problemo. The same line_eq function will work. Let us say you give me some numbers x = [1, 2, 3, 4, 5], m = 0.5 and b = 10. I will use the same function line_eq(m, x, b).

# use on vectors #
x = c(1,2,3,4,5)
m = 0.5
b = 10

line_eq(m,x,b)
> 10.5 11.0 11.5 12.0 12.5

I am beginning to like you. But, maybe you are fooling me with simple tricks. I don’t need a robot for doing simple math.

 

Hey, my name is Fran 😡

 

Okay Fran. Prove to me that you can do more complicated things.

 

Bring it on.

 

 It is springtime, and I’d love to get a Citi bike and ride around the city. I want you to tell me how many people rented the bike at the most popular route, the Central Park Southern Loop and the average trip time.

 

aargh… your obsession with the city. Give me the data.

 

Here you go. You can use the March 2017 file. They have data for the trip duration in seconds, check out time and check in time, start station and end station.

Alright. I will name the function “bike_analysis.” The inputs will be the data for the bike ridership for a month, and the name of the station. The function will identify how many people rented the bikes at the Central Park S station and returned it back to the same station — completing the loop. You asked me for total rides and the average trip time. I threw in the maximum and minimum ride time too. You can use this function with data from any month and at any station.

# function to analyze bike data # 
bike_analysis = function(bike_data,station_name)
{
 dum = which (bike_data$Start.Station.Name == station_name &    bike_data$End.Station.Name == station_name)
 total_rides = length(dum)
 
 average_time = mean(bike_data$Trip.Duration[dum])/60 # in minutes 
 max_time = max(bike_data$Trip.Duration[dum])/60 # in minutes 
 min_time = min(bike_data$Trip.Duration[dum])/60 # in minutes
 
 output = c(total_rides,average_time,max_time,min_time)
 return(output)
}

# use the function to analyze Central Park South Loop #

# bike data # 
bike_data = read.csv("201703-citibike-tripdata.csv",header=T)

station_name = "Central Park S & 6 Ave"

bike_analysis(bike_data,station_name)
> 212.000000  42.711085 403.000000   1.066667

212 trips, 42 minutes of average trip time. The maximum trip time is 403 minutes and the minimum trip time is ~ 1 minute. Change of mind?

Wow. You are truly helpful. I would have spent a lot of time if I were to do this manually. I can use your brains and spend my holiday weekend riding the bike.

 

Have fun … and Happy Easter.

 

How did you know that?

 

Machine Learning man 😉

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 8 – The search ‘for’ w’if’i

As I am taking my usual Broadway stroll this morning, I noticed some people immersed in their smartphone, texting or browsing while walking. Guilty as charged. It reminded me of the blog post a few years ago about my memory down the phone lane.

I started paying attention to my surroundings and noticed this kiosk, a free wifi hotspot provided by the City of New York.

So, I used their wifi to look up LinkNYC. Thier description tells me that “LinkNYC is a first-of-its-kind communications network that will replace over 7,500 pay phones across the five boroughs with new structures called Links. Each Link provides superfast, free public Wi-Fi, phone calls, device charging and a tablet for access to city services, maps and directions.”

They have been around since January 2016. I want to know how many such kiosks are there in the City and each borough. They have a map on their website for locating the kiosks, but it is hard for me to count points on a map. My City at service again, I found the entire wifi hotspot locations data on NYC Open Data.

Let us learn some tricks in R while trying to answer this question.

Usual chores.

Step 1: Get the data
I downloaded the hotspot data file from here.

Step 2: Create a new folder on your computer
Let us call this folder “lesson8”. The downloaded data file “Free_WiFi_Hotspots_09042005.csv” is saved in this folder.

Step 3: Create a new code in R
Create a new code for this lesson – “lesson8_code.R”.

Step 4: Choose your working directory
In this lesson, we have a folder named “lesson8”. So we should tell R that “lesson8” is the folder where the data files are stored. You can do this using the “setwd” command. If you followed the above steps and if you type list.files() in the console, you would see “lesson8_code.R” and “Free_WiFi_Hotspots_09042005.csv” on the screen as the listed files in your folder.

Step 5: Read the data into R workspace
Since we have a comma separated values file (.csv), we can use the “read.csv” command to read the data file into R workspace. Type the following line in your code and execute it. Why am I giving header=TRUE in the command?

# Read the data file #
wifi_data = read.csv("Free_WiFi_Hotspots_09042005.csv",header=T)

Step 6: Use the data to answer the questions
It will be helpful to learn some new concepts in R coding before we address the actual kiosk problem. The for loop, storing values in matrices, and if else statements.

Loops: Do you remember your fun trips to batting cages? Just like the pitching machine is set up to pitch the baseball, again and again, you can instruct R to run commands again and again. For example, if you want to print NYC ten times, instead of typing it ten times, you can use the for loop.

# print NYC 10 times
for(i in 1:10) 
{
 print("NYC")
}

You are instructing R to print NYC ten times by using the command for (i in 1:10). The lines within the {} will be repeated ten times. Select the lines and hit the “Run” button to execute. You will see this in the console screen.

[1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"
 [1] "NYC"

Matrices: I am sure you have been to a college or public library to check out books. Can you remember the bookcases that store the books? Just like a bookshelf with many horizontal and vertical dividers, in R you can create matrices with many rows and columns and use them to store values (numbers or text). Type the following lines in your code.

# create an empty matrix of NA in 10 rows and 1 column
x = matrix(NA,nrow=10,ncol=1)
print(x)

You can create an empty matrix called x with ten rows and 1 column. The NA is a space. We can fill this space with numbers later.

Now imagine we combine the above instructions, we want to store NYC in a matrix or shelf; like printing NYC ten times on a paper and arranging them in each row of a bookshelf. Type the following lines and see the results for yourself. The empty matrix x will be filled with NYC using the for loop.

# store NYC in a matrix of 10 rows and 1 columns
x <- matrix(NA,nrow=10,ncol=1)
for (i in 1:10)
{
 x[i,1] = "NYC"
}
print(x)

If Else Statement: Do you remember cleaning your room on a Sunday? Recall those terrible times for a minute and think about what you did. If you find a book lying around, you would have put it in the bookshelf. If you find a toy, you may have put it in your toy bag. If you find a shirt or a top, it goes into the closet. The if else statement works exactly like this. You are instructing R to perform an action if some condition is true, else some other work.

if (condition) {statement} else {statement}

For example, let us say we want to print a number 1 if the 10th row of the matrix x is “NYC”; else, we want to print a number 0. We can do this using the following lines.

# if else statement #
if (x[10,1] == "NYC") {print(1)} else {print (0)}

The conditon is x[10,1] == “NYC”, the output is printing 1 or 0.

Okay, let us get back to the wifi hotspot business. If you look at the Free_WiFi_Hotspots_09042005.csv file, you will notice that the 4th column is an indicator for which borough the kiosk is in, and the 5th column is the indicator for the service provider. LinkNYC is not the only free wifi provider in the city. Time Warner, AT and T, Cable Vision are some of the other providers.

So here is a strategy or a set of instructions we can give R to see how many LinkNYC kiosks are there is Manhattan. We first check for the total number of kiosks. There are 2061 kiosks in the City (number of rows of the table). Wow..

We look through each row (kiosk) and check if the provider is LinkNYC, and the borough is Manhattan. If it is the case, we assign a 1; else we assign a 0. We count the total number of ones. Here is the code.

## finding NYC LINK kiosks in Manhattan ##
n = nrow(wifi_data)

linknyc_hotspots = matrix(NA,nrow=n,ncol=1)

for (i in 1:n)
{
 if((wifi_data[i,4]=="MN") & (wifi_data[i,5]=="BETA LinkNYC - Citybridge")) 
{linknyc_hotspots[i,1] = 1} else {linknyc_hotspots[i,1] = 0}
}

sum(linknyc_hotspots)

Notice what I did.

I first created an empty matrix with 2061 rows and 1 column.

Using the for loop, I am looking through all the rows.

For each row, I have a condition to check – the provider is LinkNYC, and the borough is Manhattan. Notice how I check for the conditions using the (condition 1) & (condition 2).
If these conditions are true, I will fill the empty matrix with 1, else, I will fill it with 0.

In the end, I will count the total number of ones.

There are 521 LinkNYC kiosks in Manhattan. Still Wow..

Can you now tell me how many LinkNYC kiosks are there in Brooklyn and how many Time Warner Cable kiosks are there in the Bronx?

While you tell me the answer, I want to sing Frank Sinatra for the rest of the day. I love my City. I can take long walks; I can be car free; I can get free wifi access; I can even get free data to analyze.

Wait, maybe I should not use “free” in my expression of the love for the City.

April 18th is fast approaching.

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 5 – Let us ‘R’eview

I am staring at my MacBook, thinking about the opening lines. As my mind wanders, I look over the window to admire the deceivingly bright sunny morning. The usually pleasant short walk to the coffee shop was rather uncomfortable. I ignored the weatherman’s advice that it will feel like 12 degrees today. Last week, my friend expressed his concern that February was warm. I begin to wonder how often is Saturday cold and how many days in the last month were warm. Can I look at daily temperature data and find these answers? Maybe, if I can get the data, I can ask my companion R to help me with this. Do you think I can ask R to find me the set of all cold days, the set of all warm days and the set of cold Saturdays? Let us check out. Work with me using your RStudio program.

Step 1: Get the data
I obtained the data for this review from the National Weather Service forecast office, New York. For your convenience, I filtered the data with our goals in mind. You will notice that the data is in a table with rows and columns. Each row is a day. The columns indicate the year, month, day, its name, maximum temperature, minimum temperature and average temperature. According to the National Weather Service’s definition, the maximum temperature is the highest temperature for the day in degrees Fahrenheit, the minimum temperature is the lowest temperature for the day, and the average temperature for the day is the rounded value of the average of the maximum and the minimum temperature.

Step 2: Create a new folder on your computer
When you are working with several data files, it is often convenient to have all the files in one folder on your computer. You can instruct R to read the input data from the folder. Download the “nyc_temperature.txt” file into your chosen folder. Let us call this folder “lesson5”.

Step 3: Create a new code in R
Create a new code for this lesson. “File >> New >> R script”.
Save the code in the same folder “lesson5” using the “save” button or by using “Ctrl+S”. Use .R as the extension — “code_lesson5.R”. Now your R code and the data file are in the same folder.

Step 4: Choose your working directory
Make it a practice to start your code with the first line instructing R to set the working directory to your folder. In this lesson, we have a folder named “lesson5”. So we should tell R that “lesson5” is the folder where the data files are stored. You can do this using the “setwd” command.

The path to the folder is given within the quotes. Execute the line by clicking the “Run” button on the top right. When this line is executed, R will read from “lesson5” folder. You can check this by typing “list.files()” in the console. The “list.files()” command will show you the files in the folder. If you followed the above steps, your would see “code_lesson5.R” and “nyc_temperature.txt” on the screen as the listed files in your folder.

Step 5: Read the data into R workspace
The most common way to read the data into R workspace is to use the command “read.table”. This command will import the data from your folder into the workspace. Type the following line in your code and execute it.

Notice that I am giving the file name in quotes. I am also telling R that there is a header (header=TRUE) for the data file. The header is the first row of the data file, the names of each column. If there is no header in the file, you can choose header=FALSE.

Once you execute this line, you will see a new name (nyctemperature) appearing in the environment space (right panel). We have just imported the data file from the “lesson5” folder into R.

Step 6: Use the data to answer the questions
Let us go back to the original questions. How many days in the last month were warm, and how often is Saturday cold.

Let us call data for the months of January and February as the sample space S. S is the set of all data for January and February. Type the following lines in your code to define sample space S.

Notice that S is a table/matrix with rows and columns. In R, S[1,1] is the element in the first row and first column. S[1,7] is the element in the first row and seventh column, i.e. the average temperature data for the first day. If you want to choose the entire first row, you can use S[1, ] (1 followed by a comma followed by space within the square brackets). If you want to select an entire column, for instance, the average temperature data (column 7), you can use S[ ,7] (a space followed by a comma followed by the column number 7).

To address the first question, we should identify warm days in February. We need to define a set A for all February data, and a set B for warm days.

Recall lesson 4 and check whether A and B are subsets of S.

Type the following lines in your code to define set A.

S[ ,2] is selecting the second column (month) from the sample space S. The “which(S[ ,2]==2)” will identify which rows in the month column are equal to 2, i.e. we are selecting the February days. Notice that A will give you numbers 32 to 59, the 32nd row (February 1) to 59th row (February 28) in the data.

Next, we need to define set B as the warm days. For this, we should select a criterion for warm days. For simplicity, let us assume that a warm day is any day with an average temperature greater than or equal to 50 degrees F. Let us call this set B = set of all warm days. Type the following lines in your code and execute to get set B.

S[ ,7] is selecting the 7th column (average temperature) from the sample space S. The “which(S[ ,7]>=50)” will identify which rows have an average temperature greater than or equal to 50. Notice that B will give your numbers 12, 26, 39, 50, 53, 54, 55, 56, and 59; the rows (days) when the average temperature is greater than or equal to 50 degrees F. February 28th, 2017 had an average temperature of 53 degrees F. I believe it was that day when my friend expressed his unhappiness about the winter being warm!

Now that we have set A for all February data, and set B for warm days, we need to identify how many elements are common in A and B; what is the intersection of A and B. The intersection will find the elements common to both the sets (Recall Lesson 4 – intersection = players interested in basketball and soccer). Type the following line to find the intersection of A and B.

The “intersect” command will find the common elements of A and B. You will get the numbers 39, 50, 53, 54, 55, 56, and 59. The days in February that are warm. Seven warm days last month — worthy of my friend’s displeasure.

Can you now tell me how often is Saturday cold based on the data we have? Assume cold is defined as an average temperature less than or equal to 25 degrees F.

Did you realize that I am just whining about Saturday being cold? Check out the “union” command before you sign off.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.

Lesson 2 – R is your companion

  is your companion in this journey through data. It is a software environment that performs the analysis and creates plots for us upon instruction. People who are familiar with computer programming need no introduction. Others who are just getting started with data analysis but are skeptical about computer programming – count the number of 1’s in this data. You will need no more convincing to embrace R.

You can follow these steps to install R and its partner RStudio.

1) Installing R: Use this link https://cran.r-project.org to download and install R on your computer. There are separate download links for Mac, Windows or Linux users.

2) Installing RStudio: Use this link https://www.rstudio.com/products/rstudio/download/ to download and install RStudio on your computer. RStudio will be your environment for coding and visualizations. For RStudio to run, you need to install R; which is why we followed step 1.

Getting started with RStudio
Open RStudio – you should see three panels, the console, environment and history and the files panels.

In the console panel, you will get a copyright message and a “>” prompt. Type 1+2 and hit enter here to check that RStudio is working. You should see the number 3 pop up, which means you are ready to go.

Writing your code
You can use a text editor to give instructions to RStudio (writing your code) and save those instructions (saving your code). In the text editor, you can also write comments to help you remember the steps and make it user-friendly and readable. As a simple example, imagine you want to add 2 numbers, 1 and 2 and you want to write instructions for this. You can type

The first line starts with a #; it is a comment — for people to know that you are adding 2 numbers.

The second line is the actual code or instruction to add 1 and 2.

Remember, a well-written code with clear comments is like having good handwriting.

Opening the text editor
Once you load RStudio, go to “File >> New >> R script”.
You will now see the text editor as the 4th panels on RStudio. You can save the code using the “save” button or by using “Ctrl+S”. Use .R as the extension — “code.R”.

Some simple code
Type the following lines in your code and execute them by clicking the “Run” button on the top right.

Shortcut: You can also place the cursor on the line and hit “Ctrl+Enter” if you are using Windows or “Cmd+Enter” if you are using a Mac. Notice the results of your code in the console.

I have more simple instructions for practice here. Copy them to your code and have fun coding.

We will learn data analysis using RStudio. I will provide more coding tricks as we go along.

 

If you find this useful, please like, share and subscribe.
You can also follow me on Twitter @realDevineni for updates on new lessons.