AppliedStatisticsR

Applied Statistics with R

TOURO COLLEGE COURSE SYLLABUS

LANDER COLLEGE

DEPARTMENT: Mathematics

COURSE TITLE: Applied Statistics with R

COURSE NUMBER: MAT 266

PREREQUISITES: MAT 261 Introduction to Statistics

CREDIT HOURS: 3

DEVELOPER: Dr. Kaganovskiy

LAST UPDATE: 2/4/2018

COURSE DESCRIPTION

This course introduces students to a widely applicable contemporary field of Applied Statistics, particularly stressing the use of free computer package R, which allows a quick and precise way to solve Statistical problems. This discipline is an integral part of contemporary Natural Science as well as Economics, Finance, Biology, and Medicine... We focus on how to use R to solve real world problems. The course is at the intermediate level, with just a basic knowledge of Math (not including Calculus) and basic Statistics and Computing computing as the only prerequisites. The Statistical topics included in the course are Data Analysis, Confidence intervals, Hypothesis testing, Goodness of fit, Linear Correlation and Regression, Analysis of Variance, Non-parametric methods.

COURSE/DEPARTMENTAL OBJECTIVES The student will:
 * Learn about R - package in relation to solving real-world problems.
 * Learn about applications of Statistics to problems in Medicine, Biology, Ecology, etc....
 * <span style="font-family: Arial,Helvetica,sans-serif;">Learn about Modeling techniques.

<span style="font-family: Arial,Helvetica,sans-serif;">COURSE/INSTITUTIONAL OBJECTIVES <span style="font-family: Arial,Helvetica,sans-serif;">This course is intended to teach students the basic concepts of Statistical inference using computer. This should further professional and pre-professional career interests for students in the fields of science and business. Goals include the fostering of analytical and quantitative thinking, and the ability to solve problems and interpret data.

<span style="font-family: Arial,Helvetica,sans-serif;">HARDWARE/SOFTWARE/MATERIALS REQUIREMENTS:

<span style="font-family: Arial,Helvetica,sans-serif;">Freely available R computational packages.

<span style="font-family: Arial,Helvetica,sans-serif;">COURSE REQUIREMENTS

<span style="font-family: Arial,Helvetica,sans-serif;">Homework Assignments. Midterm and Final Exams

<span style="font-family: Arial,Helvetica,sans-serif;">Individual and group projects.

<span style="font-family: Arial,Helvetica,sans-serif;">GRADING GUIDELINES <span style="font-family: Arial,Helvetica,sans-serif;">Students must turn in regular homework as well as longer and more complex projects. Grades are to be based on the weighted average of the grades for projects, homework, and two exams.

<span style="font-family: Arial,Helvetica,sans-serif;">METHODOLOGY <span style="font-family: Arial,Helvetica,sans-serif;">Classroom lectures and assigned homework problems.

<span style="font-family: Arial,Helvetica,sans-serif;">COURSE TEXT <span style="font-family: Arial,Helvetica,sans-serif;">Gareth et. al. Statistical Learning http://www.springer.com/us/book/9781461471370

Mosaic Student Guide for R

<span style="font-family: Arial,Helvetica,sans-serif;">Additional Textbooks:

<span style="font-family: Arial,Helvetica,sans-serif;">A. Field and J. Miles Discovering Statistics Using R. <span style="font-family: Arial,Helvetica,sans-serif;">Pace, Beginning R Introduction to Statistical Programming, 2nd Ed <span style="font-family: Arial,Helvetica,sans-serif;">B. Shahbaba "BioStatistics with R" <span style="font-family: Arial,Helvetica,sans-serif;">Professor website: [] <span style="font-family: Arial,Helvetica,sans-serif;">Book website: [] <span style="font-family: Arial,Helvetica,sans-serif;">J. Verzani "Using R for Introductory Statistics"

osaic Student Guide HW Ch3
 * HW for Mosaic Guide:**

Read about CPS85 file with ?CPS85 command. Investigate numerical variable wage running it through all the summaries in Ch 3. When a categorical variable is involved use sector factor variable. Make sure to explain each picture and/or summary -- I will ask for explanations on exam.

Mosaic Student Guide HW Ch4 Investigate categorical variables sex and sector (when more than 2 categories are needed) running it through all the summaries in Ch 4. Make sure to explain each picture and/or summary.

Mosaic Student Guide HW Ch5 Again use CPS85 file. Do not filter to just females, use the whole file. Investigate correlation and regression between numerical variables wage and educ running it through all the summaries in Ch 5. When 3rd numerical variable is needed use exper. When a categorical variable is needed, use sex. Make sure to explain each picture and/or summary.

Mosaic Student Guide HW Ch6 Investigate 2 by 2 tables and Chi-Squred tests between categorical variables married and sex in CPS85 running it through all the summaries in Ch 6. Make sure to explain each picture and/or summary. Mosaic Student Guide HW Ch7 Investigate numerical output vs categorical input for wage vs sex in CPS85. For ANOVA model use wage vs sector. running it through all the summaries in Ch 7 (NOTE not to create new variable subgr like in the text and remove "subgr" from Tukey commands). Make sure to explain each picture and/or summary.


 * HW for Gareth Book: (solutions are at** http://blog.princehonest.com/stat-learning/)

Gareth at all Statistical Learning, My HW

Gareth at all Statistical Learning, My HW

Ch3: 1-4 8 9 10 14 Ch4: 10 11 13 Ch5: 3 4 6 8 9 Ch6: 3 4 8 9

Gareth exam based on book code HW Ch 3:

Use Ch3 code to investigate the multivariable regression for wage vs other variables in CPS85 (**library(mosaic)**) file. 1st do simple regression on wage vs educ, then add age predictor. Do multivariable regression of wage vs educ+age+exper Why multivariable regression of wage on all other variables looks strange. (Hint investigate vif numbers for the predictors). Investigate interaction of educ with age and educ with sector. Consider nonlinear transformation of wage vs educ. Consider regression on qualitative variable sector and its interaction with educ

Ch 4: Use CPS85 ( ** library(mosaic) ** ) file. Investigate correlation between numerical variables in this file Create the new wage factor variable with [below (low) and above (high)] median factor wageF: wageF = rep(0, length(CPS85$wage)); wageF wageF[CPS85$wage > median(CPS85$wage)] = 1 wageF= factor(wageF,levels=c(0,1),labels=c("low","high")); wageF CPS85 = data.frame(CPS85, wageF) summary(CPS85) Perform glm on wageF vs other variables, but not wage, and age. WHY???? NOTE to replace all hard-coded parts with proper numbers or commands To produce test and train subsets use dim(CPS85) n = dim(data1)[1]; n train = sample(1:n,n/2); train length(train) CPS85.train = CPS85[ train, ] CPS85.test = CPS85[-train, ] wageF.train=wageF[ train] wageF.test =wageF[-train] Perform glm wageF vs only variables found to be significant in the previous step Again NOTE to replace all hard-coded parts with proper numbers or commands

Ch 5: Use CPS85 file. 1st perform the steps on choosing the proper model for wage vs exper For basic bootstrapping create plot and CI for the sum of education and experience. For bootstrapping Estimating the Accuracy of a Linear Regression Model use again linear model for wage vs exper and wave vs quadratic model in experience.

Ch 6: Use CPS85 file. Investigate regsubsests, including forward and backwards techniques. Folds code is a bit tricky, FYI Perform ridge and lasso regressions. What is the difference, advantages of lasso? Perform Principal Components Regression and Partial Least Squares-- make sure to choose new optimal number of components.


 * HW for Pace book:**

<span style="font-family: Arial,Helvetica,sans-serif;">Ch 1

<span style="font-family: Arial,Helvetica,sans-serif;">1) Create variables x and y with numbers of your choice and illustrate algebraic operations with them. Create a vector xv of numbers from -3 to 6 using colon operator and illustrate arithmetic operations with it. Use xv = seq(-10,10,by=5) to create a vector with different step, cube its elements, and sum them using sum function. Create several strings and combine them into a list <span style="font-family: Arial,Helvetica,sans-serif;">2) Let our small data set be 2 5 4 10 8. Enter this data into a vector x using c. Find the square of each number. Subtract 6 from each number. Subtract 9 from each number, square the answers, and average them. <span style="font-family: Arial,Helvetica,sans-serif;">3) Make a vector, z, containing a sequence of 10 randomly generated men's height using normal distribution with a mean of 69 and a standard deviation of 2.5. Perform operations on it similar to section 1.3.1. Replace one element by NA and compute mean with and without this NA element. <span style="font-family: Arial,Helvetica,sans-serif;">4) Produce a 3 by 3 matrix with numbers of your choice and carry operations similar to section 1.3.4 <span style="font-family: Arial,Helvetica,sans-serif;">5) Create a ragged (different size) list of 4 data vectors and find mean and sd of each element similar to 1.3.5

<span style="font-family: Arial,Helvetica,sans-serif;">Ch 2

<span style="font-family: Arial,Helvetica,sans-serif;">1) Work with today date similar to section 2.1 <span style="font-family: Arial,Helvetica,sans-serif;">2) Produce a small text file with quote of your choosing and manipulate it as in 2.2 replacing part of it. <span style="font-family: Arial,Helvetica,sans-serif;">3) The average distance from the center is computed by (|x1 −xbar| + · · · + |xn − xbar|)/n, where xbar is the mean of the data vector. Compute this for the rivers data set ( library(UsingR)) using the function sum to add the values and abs to find the absolute value. <span style="font-family: Arial,Helvetica,sans-serif;">4) Use summary to investigate Pima.tr ( from library(MASS) ) – pay particular attention to the last variable called type (Diabetes). Select subsets of column variables 3 to 5 (also using names of variables in subset ), then rows 10 to 20, and finally bmi more than 30 and blood pressure more than 85. Select rows satisfying bmi >29 and display corresponding glu, age and type. Use which to discover rows of super-obese women with bmi > 45 and display corresponding information from the whole data frame. Produce histogram of bmi. Add a new variable to Pima.tr data frame called obese which assigns TRUE if bmi>30 and FALSE overwise.

<span style="font-family: Arial,Helvetica,sans-serif;">Ch4 <span style="font-family: Arial,Helvetica,sans-serif;">1) Fibonacci numbers are 0 1 1 2 3 5 8 13 21 34 -- each number after the first two is the sum of the preceding two. f(n) = f(n-1) + f(n-2). (a) Write a loop computing 1st 20 Fibonacci numbers (b) Write a loop computing Fibonacci numbers until number goes over 10000. <span style="font-family: Arial,Helvetica,sans-serif;">2) (a) Create nested for loops to compute sum of the part above the main diagonal of a square matrix with entries 1 to N^2 where N by N are dimensions of the matrix. (b) Create while loop to sum entries of a vector 1^2,2^2,3^3,...N^2 until N^2 is reached <span style="font-family: Arial,Helvetica,sans-serif;">3) Produce a data frame of random normally distributed numbers using command A = data.frame(replicate(10, rnorm(20)) ). Find column sums using both apply and colSums <span style="font-family: Arial,Helvetica,sans-serif;">4) Use tapply in mtcars data file to compute means of mpg by am, by cyl, and by both am and cyl. Do the same using aggregate function. <span style="font-family: Arial,Helvetica,sans-serif;">5) In section 4.3 test RejectNull on different numbers, explain results. Create an ifelse command which computers log for a vector of numbers including 0’s and negative numbers similar to sqrt code in 4.3.

<span style="font-family: Arial,Helvetica,sans-serif;">Ch5 <span style="font-family: Arial,Helvetica,sans-serif;">1) Recode Ch4 HW 1 as a function taking number of Finbonacci numbers or 10000 in problem 1(b) as user input. Test this function <span style="font-family: Arial,Helvetica,sans-serif;">2) Recode Ch4 HW 2 as a function taking the number of terms or error as user inputs and test this function

<span style="font-family: Arial,Helvetica,sans-serif;">Ch 6 <span style="font-family: Arial,Helvetica,sans-serif;">1) The probability that a student is accepted to a prestigious college is 0.3. If 10 students from the same school apply, what is the probability that exactly 4 are accepted? What is the probability that 4 or less are accepted? What is the probability that 5 or more are accepted? <span style="font-family: Arial,Helvetica,sans-serif;">2) Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that tourists will see exactly 4 lions, 4 or fewer, 5 or more? <span style="font-family: Arial,Helvetica,sans-serif;">3) National Heart, Lung, and Blood Institute defines the following categories based on Systolic Blood Pressure (SBP): • Normal: SBP ≤ 120. • Prehypertension: 120 < SBP ≤ 140. • High blood pressure: SBP > 140. If SBP in the US has a normal distribution such that SBP ∼ N(mean = 125, var = 152), (a) Find the probability of each group. (b) Find the intervals that include 68, 95, and 99.7% of the population. (c) What are the lower and upper tail probabilities for SBP equal to 115? (d) Repeat question (c) for the sample mean of 20 patients. <span style="font-family: Arial,Helvetica,sans-serif;">4) (a) Find probability under t-distribution of t<2, t>3 with df=10? (b) Find 5th and 95th quantiles with df=10.

<span style="font-family: Arial,Helvetica,sans-serif;">Ch 7 <span style="font-family: Arial,Helvetica,sans-serif;">1) In birthwt ( from library(MASS) ) investigate if race is equally distributed. Note that although not required here, the original file has race as a number, while is should be made a factor with birthwt$race = factor(birthwt$race,levels=c(1,2,3), labels=c("White","AfrAmer","Other")); Use table to see frequencies. <span style="font-family: Arial,Helvetica,sans-serif;">2) Professor Bumblefuss takes a random sample of students enrolled in Statistics 101. There are 25 freshman, 32 sophomores, 18 juniors, and 20 seniors. Test the null hypothesis that freshman, sophomores, juniors, and seniors are equally represented. Professor Iconoclast argues that Bumblefuss is wrong, and that the number of freshman and sophomores enrolled is each twice the number of juniors and the number of seniors. Expected probabilities are now 1/3, 1/3, 1/6, and 1/6. Test Iconclast hypothesis. <span style="font-family: Arial,Helvetica,sans-serif;">3) In the built-in data set survey (library(MASS)), the Smoke column records the students smoking habit, while theExer column records their exercise level. The allowed values in Smoke are "Heavy", "Regul" (regularly), "Occas" (occasionally) and "Never". As forExer, they are "Freq" (frequently), "Some" and "None". Tally the students smoking habit against the exercise level with the table function in R and runchi-squared test of independence. Also use CrossTable function and interpret the results. <span style="font-family: Arial,Helvetica,sans-serif;">4) The following data shows the effectiveness of the antidepressant Celexa in the treatment of compulsive shopping. Conduct independence test. Also use CrossTable function and interpret the results. <span style="font-family: Arial,Helvetica,sans-serif;">outcome <span style="font-family: Arial,Helvetica,sans-serif;">treat worse same better <span style="font-family: Arial,Helvetica,sans-serif;">Celexa 28 31 71 <span style="font-family: Arial,Helvetica,sans-serif;">placebo 31 84 22

<span style="background-color: #ffffff; font-family: Arial,Helvetica,sans-serif;">Ch 8 <span style="background-color: #ffffff; font-family: Arial,Helvetica,sans-serif;">1) In birthwt ( from library(MASS) ) find mean, median, and mode of the birth weight. Produce histogram with mean and median similar to page 75. Compute variance, standard devaition, and cv. Produce boxplot. <span style="background-color: #ffffff; font-family: Arial,Helvetica,sans-serif;">2) Design small set of numbers in x = c with an outlier which would illustrate well the different between mean and median <span style="background-color: #ffffff; font-family: Arial,Helvetica,sans-serif;">3) Apply basicStats to a selection of columns in birthwt data file.

<span style="background-color: #ffffff; font-family: Arial,Helvetica,sans-serif;">Ch 9 <span style="background-color: #ffffff; font-family: Arial,Helvetica,sans-serif;">1) In mtcars from MASS library produce bar and pie plots of cyl variable. Although it is not required, it is better to turn cyl into factor variable with the command: cylF = factor(mtcars$cyl,levels=c(4,6,8),labels=c("Four","Six","Eight")) 2) Produce boxplot of mpg by cylF in base R and in ggplot, produce boxplot of just mpg in base R and in ggplot. 3) Produce histogram, dotplot, frequency polygon, and smoothed density plot of mpg. Experiment with different binwidths. Break the frequency polygon plot by cylF 4) Produce scatterplot with least squares line and shaded confidence region for mpg vs car weight. Use transparency and jitter to separate points. 5) Produce hexbin picture similar to the book for price vs depths of diamonds.

Exam 1 prep focus: Chapter 2 question 4 on the homework Chapter 1 number 1&2 Chapter 6 number 1 Chapter 6 number 2 Chapter 6 number 3 Chapter 7 number 1,2 Chapter 7 number 4 Chapter 8 number 1&3 Chapter 9 number 1,2,3,4,5

Ch 10 1) Produce two variables x and y for manual vs automatic transmission mpg. Perform a variance comparison test, means t-test, means Wilcoxon test. Repeat above test in more natural mpg~am notation. You have to use both confidence interval and p-value conclusions. Make sure to look at histograms and boxplots of this dependencies and remark on how it supports your findings. Use aggregate functions to find means and standard deviations of mpg by transmission. <span style="color: #231f20; font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">2) Listed below are times (seconds) that animated Disney movies showed the use of tobacco and alcohol (each column corresponds to a movie). Use a 0.05 significance level to test the claim that they are different. <span style="color: #231f20; font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">Tobacco use (sec) 176 51 0 299 74 2 23 205 6 155 <span style="color: #231f20; font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">Alcohol use (sec) 88 33 113 51 0 3 46 73 5 74 <span style="font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">Explain how corresponding confidence interval confirms the hypothesis test above. <span style="font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;"> (b) Now, assume that the same data are NOT paired. Repeat hypothesis test and CI analysis in part (a) <span style="font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">3) Among 100 people going on the diet, 68 showed marked improvement. Investigate if it warrants use of this diet from Statistical significance point of view. <span style="font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">4) In a test of the effectiveness of Echinacea against cold viruses, 35 out of 50 subjects treated with Echinacea developed the virus. In the placebo group, 78 out of 100 subjects developed the virus. Use 0.05 significance level to test the claim that Echinacea has an effect: i.e. proportion for drug group is less than the proportion for the placebo group. Explain how corresponding confidence interval confirms the hypothesis test above.

<span style="font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">Ch 11 <span style="font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">1) Apply t.test, wilcox.test, and yuen tests to the data: group1 <- rexp(10,rate=1); group2 <- rexp(6,rate=3). Note that you have to use and explain all the steps used in my extended code including data frame format approach etc... 2) Apply bootsrapping code to the data myData <- rexp (1000, 10). Make sure to be able to explain the results of each step. 3) Apply permutation test to the data frame created in problem 1, again explain the resutls of each step.

Ch 12 1) Load the anorexia data set from the MASS package. This data set was collected to investigate the effectiveness of different treatments (Treat) on increasing weight for young female anorexia patients. Create a new variable called Difference by subtracting the weight of patient before study period (Prewt) from her weight after the study period (Postwt). Use boxplot and plot of means to visualize how this variable changes depending on the type of treatment. Use aov table and summary.lm ANOVA to investigate whether the type of treatment makes a difference in the amount of weight gain. Make sure to use TukeyHSD to interpret which particular differences comparisons are important. Use plot to make sure that ANOVA assumptions hold. Use tapply and aggregate to compare the means and sd's. 2) The data set cabbages for this example is available from the MASS package. In this data set, the two different cultivars were planted on three different dates, denoted as d16, d20, or d21. The variable Data is a factor that specifies the planting date for each cabbage. Use two-way ANOVA to evaluate the relationship between the vitamin C content and cultivars while controlling for the effect of planting dates. Use tapply and aggreate with to compare the means and sd's. Use aov table and summary.lm 3) Run the following commands (explained in class). This will create 1st a wide format of repeated measures data, but then change it to proper long format. Conduct repeated measures ANOVA. groceries = data.frame( c(1.17,1.77,1.49,0.65,1.58,3.13,2.09,0.62,5.89,4.46), c(1.78,1.98,1.69,0.99,1.70,3.15,1.88,0.65,5.99,4.84), c(1.29,1.99,1.79,0.69,1.89,2.99,2.09,0.65,5.99,4.99), c(1.29,1.99,1.59,1.09,1.89,3.09,2.49,0.69,6.99,5.15)) rownames(groceries) = c("lettuce","potatoes","milk","eggs","bread","cereal","ground.beef","tomato.soup","laundry.detergent","aspirin") colnames(groceries) = c("storeA","storeB","storeC","storeD") groceries gr2 = stack(groceries) gr2$subject = rep(rownames(groceries), 4) # create the "subject" variable gr2$subject = factor(gr2$subject) # "I declare subject to be a factor." colnames(gr2) = c("price", "store", "subject") # rename the columns gr2

4) Use command data1 = read.csv("Ch12RepeatedData.csv",header=T) to read in data file (emailed) and analyze using mixed model ezANOVA.

Ch 13 <span style="font-family: Calibri,sans-serif; font-size: 10pt;">1) Use linear regression to examine the relationship between mpg vs wt in mtcars. Plot scatterplot and regression line with basic R and 95% confidence interval for the regression line using ggplot. Interpret the estimate of regression coefficient (slope) and examine its statistical significance. Find the 95% confidence interval for the regression coefficients using confint(model). For cars 2,3, and 4 tons, estimate mpg (include CI) using predict. What is the meaning of //R//^2 and check that it is equal to square of sample correlation coefficient. Create diagnostic plots both in basic R and ggplot and identify possible outliers. Employ correlation test to tell if the correlation is significant. <span style="font-family: Calibri,sans-serif; font-size: 10pt;">2) Create variables x and y using x = seq(4,20,0.5); y = 2*x^3 + rnorm(length(x),mean=0,sd=10); y. Investigate y~x and log(y)~log(x) linear models. Remark about significance of coefficients vs actual plot fit in both cases. <span style="font-family: Calibri,sans-serif; font-size: 10pt;">3) Create variables x and y using x = seq(-5,5,0.5); y = 2*x^2 + 4*x-5 + rnorm(length(x),mean=0,sd=10); y. Investigate y~x and polynomial model using I(x^2). Remark about significance of coefficients vs actual plot fit in both cases Compare the two models using AIC and anova.

Ch 14

<span style="font-family: Arial,sans-serif; font-size: 10pt;">1) Use __all the steps__ of multiple regression methods in Ch 14 to investigate the dependence of mpg on wt, qsec and am variables in mtcars file. Include plot(m) to assess the assumptions of the model after each change in the model. NOTE: use the code below to rename the data file and set am variable as a factor: data1 = mtcars data1$am = factor(data1$am,levels=c(0,1),labels=c("manual","automatic"))

Ch 16 <span style="background-color: #ffffff; color: #231f20; font-family: &#39;Times New Roman&#39;,serif; font-size: 10pt;">1) Redo the problem (2) from HW 10 about Disney movies applying both parametric and non-parametric tests. (Hint: 1st create data frame). Make sure to do both paired and unpaired version (Hint: un-paired uses code from the beginning of Ch 11). 2) Data file grades.csv (emailed) contains two columns. 1st is grades in Statistics course and 2nd is grades on some Math test. Investigate correlation between them. Is Pearson test appropriate? Which one should you use? 3) Repeat Kruskal and one.way tests for disp vs number of cylinders in mtcars. Compare to parametric aov test.