BioStatistics


 * TOURO COLLEGE COURSE SYLLABUS **

**LANDER COLLEGE**

**DEPARTMENT:** Mathematics

**COURSE TITLE:** BIO Statistics

**COURSE NUMBER:** MAT 510

**PREREQUISITES:** MAT 120 (PreCalculus)

**CREDIT HOURS:** 3

**DEVELOPER:** Dr. Kaganovskiy

**LAST UPDATE:** 9/10/15


 * __ COURSE DESCRIPTION __ **

This course introduces students to a widely applicable contemporary field of Biostatistics, particularly stressing the use of free computer package R, which allows quick and precise way to solve Statistical problems. This discipline is an integral part of contemporary Medicine, Pharmacology, Biology, Ecology, and Agriculture etc... We focus on how to use R to solve real world problems. The course is at the introductory level, with just a basic knowledge of Math (not including Calculus) and basic computing as the only prerequisites. Biomedical examples are drawn from a number of fields including Epidemiology, Bacteriology, Cardiology, Endocrinology, Hematology, Microbiology, Nutrition, Ophthalmology - just to name a few. In addition, a broader sample of examples from Zoology, Ecology, and Demography are presented. The Statistical topics included in the course are Data Analysis, Confidence intervals, Hypothesis testing, Goodness of fit, Linear Correlation and Regression, Analysis of Variance.

The student will:
 * __ COURSE/DEPARTMENTAL OBJECTIVES __ **
 * Learn about R - package in relation to solving real-world problems.
 * Learn about applications of Statistics to problems in Medicine, Biology, Ecology, etc....
 * Learn about Modeling techniques.

This course is intended to teach students the basic concepts of Statistical inference using computer. This should further professional and pre-professional career interests for students in the fields of science and business. Goals include the fostering of analytical and quantitative thinking, and the ability to solve problems and interpret data.
 * __ COURSE/INSTITUTIONAL OBJECTIVES __ **


 * __ HARDWARE/SOFTWARE/MATERIALS REQUIREMENTS: __ **

Freely available R computational packages.


 * __ COURSE REQUIREMENTS __ **

Homework Assignments. Midterm and Final Exams

Individual and group projects.

Students must turn in regular homework as well as longer and more complex projects. Grades are to be based on the weighted average of the grades for projects, homework, and two exams.
 * __ GRADING GUIDELINES __ **

**__ METHODOLOGY __** Classroom lectures and assigned homework problems.


 * __ COURSE TEXT __ **

Crawley Statistics, An Introduction Using R, 2nd edition


 * Additional Textbooks: **

B. Shahbaba "BioStatistics with R" Professor website: [] Book website: []

B. Rosner "Fundamentals of BioStatistics"

J. Verzani "Using R for Introductory Statistics" 2nd edition A. Field and J. Miles Discovering Statistics Using R. [] Reduced price digital renting: []?__hdv=6.8

Ch 2 HW

1) In the data frame mtcars (from UsingR) select subsets of column variables 3 to 5 (also using names of variables), then rows 10 to 20, and finally displacement more than 200 and weight less than 4 tons. Select rows satisfying mpg >20 and display corresponding mpg, wt, and cyl. Order the whole data frame based on mpg in increasing and decreasing orders, do the same for subset of variables wt, drat, cyl, gear.

2) Create a summary of mtcars and remark about numerical vs factor summaries. Produce lists of means with tapply and aggregate for mpg based on number of cyl alone and then on cyl and gear together. Also produce the summary using function stat.desc

3) Produce scatter plot of mpg vs wt, boxplot of mpg vs cyl. Interpret the results.

4) Produce coplot of mpg vs number of cylinders, conditioned on wt. Interpret the results.

5) Use tapply to investigate mpg means based on cyl and gear and produce corresponding barplots. Use my code with ggplot2 to produce better quality barplots.

Ch 3 HW

1) In the data frame mtcars (from UsingR) assign variable mpg to variable x. Remark on its histogram. Find its mean, median, geometric mean, and harmonic mean. 2) Write a function which computes weighted average using formula 3) Let our small data set be 2 5 4 10 8. Enter this data into a data vector x, Find the square of each number. Subtract 6 from each number. Subtract 9 from each number and then square the answers. Use the vectorization of functions to do so. 4) The average distance from the center is computed by (|x1 −xbar| + · · · + |xn − xbar|)/n, where xbar is the mean of the data vector. Compute this for the rivers data set (UsingR) using the function sum to add the values and abs to find the absolute value.

Ch 4 HW

1) In the data frame mtcars (from UsingR) assign variable mpg to variable x, find its variance using variance function from the book and using var 2) The data set DDT ( run library(MASS) ) contains independent measurements of the pesticide DDT on kale. Make a histogram and a boxplot of the data. From these, estimate the mean and standard deviation. Check your answers with the appropriate functions. 3) Produce two variables x and y for manual vs automatic transmission mpg as follows: attach(mtcars); x = mpg[am=="manual"]; y = mpg[am=="automatic"]; Compare means and variances. Do variance test. Compare standard deviations. 4) Apply my code on p 62 to find t-based confidence interval for mpg. 5) Apply book's code on bootstrap analysis to get a plot similar to p 64 for mpg.

Ch 5 HW

1) Use rnorm to create a sample of size 30 from normal distribution representing IQ scores with known mean 100 and standard deviation 15. Investigate the data using summary, histogram and boxplot, make sure not just to compute but explain each picture and results to follow in terms of properties of the given distribution. Find mean, sd, median, iqr, and normal (z-based) and t-based confidence intervals for these numbers (Chapter 4 had code for confidence intervals). Find probability of obtaining score of 120 or more, 90 or less and between 90 and 120. Compute skew and kurtosis for the data. Use t-test to test hypothesis that mean is equal to 100, then to 110. Investigate the normality using qqnorm and qqline as well as shapiro test. Repeat finding confidence interval for 60 numbers, what do you observe about the size of CI?

2) Use cfb data file (library(UsingR)) INCOME variable to perform tasks similar to previous problem. Namely: Investigate the data using summary, histogram and boxplot, make sure not just to compute but explain each picture and results to follow in terms of properties of the given distribution. Find mean, sd, median, skew and kurtosis. Is it appropriate to find normal (z-based) and t-based confidence intervals for these numbers? Why? Is it appropriate to use t-test to test hypothesis that mean is equal to 50000. Why? What is the good alternative? Use bootstrapped test too. Investigate the normality using qqnorm and qqline as well as shapiro test.

Ch 6 HW

1) Produce two variables x and y for manual vs automatic transmission mpg as follows: attach(mtcars); x = mpg[am=="manual"]; y = mpg[am=="automatic"]; Perform a variance comparison test, means t-test, means Wilcoxon test. Repeat above test in more natural mpg~am notation. Make sure to look at histograms and boxplots of this dependence and remark on how it supports your findings. Use tapply functions to find means and standard deviations of mpg by transmission.

2) Listed below are times (seconds) that animated Disney movies showed the use of tobacco and alcohol (each column corresponds to a movie). Use a 0.05 significance level to test the claim that they are different.

Tobacco use (sec) 176 51 0 299 74 2 23 205 6 155

Alcohol use (sec) 88 33 113 51 0 3 46 73 5 74

Explain how corresponding confidence interval confirms the hypothesis test above.

(b) Now, assume that the same data are NOT paired. Repeat hypothesis test and CI analysis in part (a)

3) Among 7 people going on the diet, 6 showed marked improvement. Investigate if this warrants use of this diet from Statistical significance point of view.

<span style="font-family: 'Times New Roman',Times,serif;">4) In a test of effectiveness of Echinacea against cold viruses, 35 out of 50 subjects treated with Echinacea developed virus. In the placebo group 78 out of 100 subjects developed virus. Use 0.05 significance level to test the claim that Echinacea has an effect: i.e. proportion for drug group is less than the proportion for the placebo group. Explain how corresponding confidence interval confirms the hypothesis test above.

<span style="font-family: 'Times New Roman',Times,serif;">5) Chantix is a drug used as an aid for those who want to stop smoking. The adverse reaction of nausea has been studied in clinical trials, and the table below summarizes results (based on data from Pfizer). Use a 0.01 significance level to test the claim that nausea is independent of whether the subject took a placebo or Chantix. Does nausea appear to be a concern for those using Chantix? Calculate expected counts and check the assumptions.

<span style="font-family: 'Times New Roman',Times,serif;"> Placebo Chantix

<span style="font-family: 'Times New Roman',Times,serif;">Nausea 10 30

<span style="font-family: 'Times New Roman',Times,serif;">No nausea 795 791

<span style="font-family: 'Times New Roman',Times,serif;">6) Produce scatterplot of bmi vs bp in Pima.tr (library(MASS)) data file. Check if correlation test is appropriate using Shapiro-Wilk test. Conduct correlation test using cor and cor.test. Produce correlation matrix for columns glu bp skin bmi . Use both cor and rcorr. Use rank type correlations to find and interpret correlations between birth weight and race in birthwt file. Could we apply Pearson correlation here?

Ch 7 HW

<span style="font-family: 'Times New Roman',serif; font-size: 10pt;">1) Use linear regression to examine the relationship between mpg vs wt in mtcars. Make sure to plot scatterplot and regression line. Interpret the estimate of regression coefficient (slope) and examine its statistical significance. Find the 95% confidence interval for the regression coefficient. If car is 3 tons, what would be your estimate of mpg (include CI)? Use both summary to interpret slope and intercept and their significance. Use summary.aov to explain sums of squares, degrees of freedom, mean squares, F ratio and p-value. What is the meaning of //R//^2 in summary and show that it is equal to square of sample correlation coefficient. Create simple diagnostic plots for your model and identify possible outliers, use influence.measures to get the influencial points. In addition, employ correlation test to tell if the correlation is significant.

<span style="font-family: 'Times New Roman',serif; font-size: 10pt;">2) Create variables x and y using x = seq(4,20,0.5); y = 2*x^3 + rnorm(length(x),mean=0,sd=10); y . Investigate y~x and log(y)~log(x) linear models. Remark about significance of coefficients vs actual plot fit in both cases. Compare the two models using AIC and anova.

<span style="font-family: 'Times New Roman',serif; font-size: 10pt;">3) Create variables x and y using x = seq(-5,5,0.5); y = 2*x^2 + 4*x-5 + rnorm(length(x),mean=0,sd=10); y. Investigate y~x and polynomial model using I(x^2). Remark about significance of coefficients vs actual plot fit in both cases Compare the two models using AIC and anova.

<span style="font-family: 'Times New Roman',serif; font-size: 10pt;">4) <span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">The data set <span style="color: #231f20; font-family: Inconsolata,serif; font-size: 10pt;">wtloss <span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">( <span style="color: #231f20; font-family: Inconsolata,serif; font-size: 10pt;">MASS <span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">) contains weight measurements of an obese patient recorded during a weight-rehabilitation program. The variable <span style="color: #231f20; font-family: Inconsolata,serif; font-size: 10pt;">Weight <span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">records the patient’s weight in kilograms, and the variable <span style="color: #231f20; font-family: Inconsolata,serif; font-size: 10pt;">Days <span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">records the number of days since the start of the program. A linear model is not a good model for the data, as it becomes increasingly harder to lose the same amount of weight each week. A more realistic goal is to lose a certain percentage of weight each week. Fit the nonlinear model <span style="color: #231f20; font-family: Inconsolata,serif; font-size: 10pt;">Weight <span style="color: #231f20; font-family: CMR10,serif;">= //<span style="color: #231f20; font-family: URWPalladioL-Ital,serif; font-size: 10pt;">a //<span style="color: #231f20; font-family: CMR10,serif;">+ //<span style="color: #231f20; font-family: URWPalladioL-Ital,serif; font-size: 10pt;">b* //<span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">2^( //<span style="color: #231f20; font-family: CMSY10,serif; font-size: 8pt;">− //<span style="color: #231f20; font-family: Inconsolata,serif; font-size: 8pt;">Days <span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 8pt;">/ //<span style="color: #231f20; font-family: URWPalladioL-Ital,serif; font-size: 8pt;">c) //<span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">. The estimated value of //<span style="color: #231f20; font-family: URWPalladioL-Ital,serif; font-size: 10pt;">c //<span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">would be the time it takes to lose //<span style="color: #231f20; font-family: URWPalladioL-Ital,serif; font-size: 10pt;">b //<span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">times half the excess weight. What is the estimated weight for the patient if he stays on this program for the long run? Suppose the model held for 365 days. How much would the patient be expected to weigh? Also use gam to fit this model.

<span style="color: #231f20; font-family: URWPalladioL-Roma,serif; font-size: 10pt;">Ch 8 HW

1. Load the anorexia data set from the MASS package. This data set was collected to investigate the effectiveness of different treatments (Treat) on increasing weight for young female anorexia patients. Create a new variable called Difference by subtracting the weight of patient before study period (Prewt) from her weight after the study period (Postwt): Difference = Postwt - Prewt. Use boxplot and plot of means to visualize how this variable changes depending on the type of treatment. Use aov table and summary.lm ANOVA to investigate whether the type of treatment makes a difference in the amount of weight gain. Make sure to use TukeyHSD to interpret which particular differences comparisons are important. Use plot to make sure that ANOVA assumptions hold. Use tapply to compare the means.

2. The data set cabbages for this example is available from the MASS package. In this data set, the two different cultivars were planted on three different dates, denoted as d16, d20, or d21. The variable Data is a factor that specifies the planting date for each cabbage. Use two-way ANOVA to evaluate the relationship between the vitamin C content and cultivars while controlling for the effect of planting dates. Use tapply with list to see the means. Use aov table and summary.lm

Ch 9 HW

1) Use ANCOVA approach to investigate the dependence of mpg on wt (weight) and am (manual vs automatic transmission) in mtcars file. Make sure to use models with and without interaction and compare using anova. Use proper graphics with scatter plots, box plots and ggplot approach. Use step to reduce the problem automatically.

Ch10 HW 1) Use multiple regression methods in Ch 10 to investigate the dependence of glucose level on other continuous variables in file Pima.tr. Make sure to use and explain tree models and gam package to investigate curvature. Use plot to asses the assumptions of the model.

Ch11 HW 1) Use contrast methods in file genotype (package MASS) to investigate the effects of the levels of Mother genotype on the weight of her offspring in rats. Make sure to do orthogonal contrast approach and step by step elimination from treatment contrast as in Ch 11.