BIOE 398 Statistical Analysis for Genomic Data

Course objective:
Topics include probability theory, parameter estimation, hypothesis testing, genomics and gene regulatory networks. Build modeling and analysis capability for analyzing genomic data. Analysis of molecular and cellular processes across a hierarchy of scales, including genetic, molecular and cellular levels. Exposure to currently emerging research areas of systems biology.

Prerequisites: CS101, Math231, MCB252

Text books:

 

Reference:

Software:

  • R and Matlab (available in EWS labs).

 

Grading policies:

  • Breakdown

Homework 15%          quiz 15%                     Midterm 30%              Final 40%

  • Homework

All homework is due at the beginning of class on the designated day. No late homework will be accepted.

 

You are permitted to discuss the general aspects of the course materials and assignments with your classmates. But the homework must be your individual effort. You are encouraged to consult other sources beyond the textbooks and the outside sources must be documented when you use them. Grading is based on sample assignments.

  • Regrades

Request for regrading homework and exam has to be submitted within 48 hours after the work is returned in class. A written explanation is necessary for such a request.

 

Exams

    • Midterm. Closed book closed notes. You may bring an 8.5x11-inch (two-sided) info sheet.
    • Final. Final exam is comprehensive. Closed book closed notes. You may bring two 8.5x11-inch (two-sided) info sheets.
    • You must take the exams on the designated dates. Makeup exams will be given only under emergent circumstances. In such cases, you must contact the instructor before the examination date. Verifiable proofs and an official proof letter from Office of the Dean of Students are required.
  • Attendance

Class attendance is important. Those who attend class learn more. Quiz will be given in class with early notification.

 

Course content

Introduction to basic concepts that underlie important applications of probability and statistics to the analysis of genomic data and biomolecular networks:

·         Introduction to statistics (Montgomery Chapter 1)

·         Probability theory

·         Introduction to probability (Montgomery Chapter 2)

·         Discrete random variables (Montgomery Chapter 3)

·         Continuous random variables (Montgomery Chapter 4)

·         Two or more random variables (Montgomery Chapter 5)

·         Descriptive statistics (Montgomery Chapter 6)

·         Parameter estimation

·         Point estimation (Montgomery Chapter 7)

·         Hypothesis testing

·         One sample hypothesis testing (Montgomery Chapter 9)

·         Two sample hypothesis testing (Montgomery Chapter 10)

·         Simple linear regression (Montgomery Chapter 11)

·         Introduction to genome projects: Organization, objectives and technology (Gibson Chapter 1)

·         Mapping genomes: genetic maps, physical maps, comparative genomics

·         The Human Genome Project

·         Animal Genome Projects

·         Plant Genome Projects / food and bio-energy

·         Gene Expression (Gibson Chapter 4)

·         Parallel Analysis of Gene Expression: Microarrays

·         Microarray Image Processing

·         Visualization

·         Cancer Transcriptomics (as an application of two-sample hypothesis testing)

·        Integrative genomics and biomolecular networks (Gibson Chapter 6)

·         Signal transduction

·         Predicting transcription factor binding sites (reinforcing the concepts of multinomial distribution, independence and likelihood).

·         Gene regulatory network  (as an application of regression)

 

All Montgomery slides.

 

Science Breakthrough of the year 2005-2008

 

All Gibson slides

 

Tenetative schedule

01/20      Lecture. Notes for using R: Note1, Note2, Note3. Hw assignment: explore R platform. Read Montgomery book Ch1.

01/22      Lecture.

01/27      Lecture. HW1: Montegomery book: 2-72, 2-74, 2-82, 2-85, 2-89, 2-94, 2-112, 2-119, 2-120, Execute the R scripts in the notes 2.1, 2.2 and attach the R output tables/figures from these execution.

01/29      Lecture. Notes for using R: 2.1, 2.2

02/03      Lecture

02/05      Lecture. 

02/10      HW1 due. Lecture

02/12      Lecture

02/17      Quiz. Lecture

02/19      No class.

02/24      Lecture

02/26      Lecture. HW2: Montegomery book : 3-76, 3-100, 4-51, 5-14

03/03      Lecture

03/05      HW2 due. Lecture

03/10      No lecture. In class Q&A session.

03/12      Midterm. exam.

03/17      Lecture

03/19      Lecture

03/24      Spring break

03/26      Spring break

03/30      Lecture

04/02      Lecture

04/07      Lecture

04/09      Quiz. Lecture

04/14      Lecture

04/16      Lecture

04/21      Quiz. Lecture

04/23      Lecture. 

HW3:

In one study, Lin et al (Nature Biotechnology 24(12): 6-7) measured gene expression data in five colorectal adenocarcinomas and matched normal colonic tissues. Download the dataset from http://genomics.bioen.uiuc.edu/bioe598/data/colon-cancer.xls.

Perform a T-test between the cancer and the normal samples and identify genes that are either up or down-regulated in colon cancer. Specify your null hypothesis and alternative hypothesis. Write out the form of the test statistic. With p-value cutoff of 0.0001, how many genes do you identify? Order the identified genes by p-values. Select one or two genes from the ones that you have identified. Give its (their) p-value(s) and rank(s) of the p-value(s). Search related literature and comment on why it (they) may have been up or down regulated in cancer tissues.

Tips:

1. T-tests can be performed with Microsoft Excel, see http://www.wfu.edu/~massd2/T_test.htm for details.

2. If you are tired of looking through PUBMED, OMIM can be a good resource to help you to interpret some genes.

04/28      Reserved for invited talk

04/30      No lecture. In class Q&A session.

05/05      HW3 due. Final exam.