homework 2 - multiple testing

bios25328
homework
Author

Haky Im

Published

April 7, 2025

Modified

April 16, 2025

This assignment focuses on simulating multiple testing scenarios, understanding p-value distributions, and applying correction methods.

Part 1: Lecture Quizzes (20 points)

Part 2. Multiple testing simulation (80 points)

Goal: Simulate multiple tests to explore p-value behavior, FWER, and FDR.

Parameters:

  • Samples per sim: N = 100
  • Effect size (alt): beta = 1
  • Number of sims: 10000
  • True null proportion: pi0 = 0.9
  • Significance level: alpha = 0.05
  • Models: X=rnorm(N), eps=rnorm(N), Ynull=0X+eps, Yalt=betaX+eps

Tasks:

2.1. Simulation Run (10 pts)

  • Run 10,000 simulations. In each:
  • Generate X, Ynull, Yalt.
  • Fit lm(Ynull ~ X) and lm(Yalt ~ X).
  • Store the p-value for X from both models.
  • Create two vectors: p_null and p_alt (10,000 p-values each).

2.2. P-value Distributions (10 pts)

  • Plot histograms for p_null and p_alt.
  • Interpret their shapes (Why uniform? Why skewed?).

2.3. Mixed P-values (10 pts)

  • Create a logical vector is_null (length 10k, pi0 proportion TRUE). Shuffle it.
  • Create p_mixed by selecting from p_null if is_null is TRUE, else from p_alt.
  • Plot a histogram and a QQ plot (p_mixed vs. uniform quantiles, add y=x line, feel free to use the qqunif function here https://gist.github.com/hakyim/38431b74c6c0bf90c12f).
  • Interpret the QQ plot (adherence to and deviation from the line).

2.4. Confusion Matrix & Reality Check (15 pts)

  • Using p_mixed, is_null, and alpha = 0.05, calculate the counts for TN, FP, FN, TP.
  • In real data analysis, can this table be perfectly known? Explain why/why not.

2.5. Bonferroni Correction (15 pts)

  • Calculate the Bonferroni-adjusted threshold for p_mixed.
  • How many tests are significant using this threshold?
  • Recalculate TN, FP, FN, TP counts with the Bonferroni threshold.
  • Discuss the FP vs. TP trade-off compared to the uncorrected alpha = 0.05.

2.6. FDR Control with qvalue (20 pts)

  • Use qvalue::qvalue() on p_mixed.
  • Report the estimated pi0 from the result. How does it compare to the true pi0 used?
  • How many tests are significant at an FDR threshold of 0.1 (q-value ≤ 0.1)?
  • Compare this number of significant tests to the numbers from the uncorrected and Bonferroni approaches.
  • Briefly state the conceptual difference between controlling Bonferroni correction and FDR.

Extra Credit

EC.1. More realistic genotype predictor (+10 points)

  • Rerun key parts of the simulation (e.g., 2.1-2.6) using X simulated as genotypes {0, 1, 2} assuming MAF=0.3 and Hardy-Weinberg Equilibrium. You will want to use rbinom(nsam,2,0.3) to simulate genotypes. Report key findings/differences.

EC.2. Binary trait (+10 points)

  • Rerun key parts using a simulated binary trait Y (dependent on X for alternatives) and logistic regression (glm(Y ~ X, family=binomial)) to obtain p-values. Report key findings/differences.

Submission Guidelines

Submit a single, runnable, commented R script (.R) or R Markdown (.Rmd). Include all code, generated plots, and written answers/interpretations clearly labeled.

Hint: feel free to use the code from the lecture slides or here

© HakyImLab and Listed Authors - CC BY 4.0 License