Lecture 8: Introduction to QTL Analysis

bios25328
lecture
Author

Haky Im

Published

April 21, 2025

Find the lecture notes here

Introduction to QTL Analysis

Overview

This lecture covers the core concepts of Quantitative Trait Locus (QTL) analysis, its applications in understanding complex traits and disease susceptibility, and the integration of various genomic resources like GWAS Catalog, ENCODE and GTEx project to understand the mechanisms underlying complex traits.

Learning Objectives

By the end of this lecture, you should be able to:

  1. Understand the purpose and applications of QTL analysis
  2. Describe the GTEx project and its key findings
  3. Understand fine-mapping methods and their applications
  4. Explain functional annotation and heritability partitioning methods

Lecture Summary

I. Introduction to QTL Analysis

A. Definition and Purpose

  • QTL analysis aims to identify genetic variants that influence gene expression levels and other molecular traits (e.g., protein levels, DNA methylation, etc.)
  • These expression QTLs (eQTLs) can help identify genes that influence complex traits and disease susceptibility
  • By linking genetic variation to gene expression changes, QTL analysis provides insights into the molecular mechanisms underlying complex traits

B. GWAS Catalog

  • Tracks findings from Genome-Wide Association Studies
  • Contains data on numerous loci associated with various traits, including cancer
  • A large proportion of GWAS catalog variants are located in non-coding regions

II. Mechanisms Linking Genetic Variation to Complex Traits

A. Protein Levels and Disease Risk

  • Genetic variation can affect disease risk by influencing protein levels
  • eQTLs (expression Quantitative Trait Loci) and pQTLs (protein Quantitative Trait Loci) play key roles

B. mRNA and Splicing

  • Genetic variants primarily affect complex traits through mRNA regulation and pre-mRNA splicing
  • RNA splicing is another key link between genetic variation and disease

III. The GTEx Project

A. Goals and Objectives

  • Characterize genetic effects on the transcriptome across human tissues
  • Connect regulatory mechanisms to trait and disease associations

B. Data and Samples

  • Large dataset of tissue samples from human donors
  • Includes various tissue types, mRNA-seq data, and WGS data

C. Key Findings

  • eQTLs are genetic variants associated with mRNA levels
  • Number of eGenes increases with sample size
  • Reveals complex patterns of cis- and trans-QTLs
  • Identifies sex-biased and population-biased eQTLs

IV. Fine-Mapping

A. Definition and Purpose

  • Identifies specific causal variants within GWAS-associated regions

B. Methods

  • Analysis of association results for potential causal variants
  • Building credible sets of SNPs
  • Calculating posterior probabilities of causality

C. Fine-mapping Methods

  • CAVIAR
  • Finemap
  • fastPAINTOR
  • DAP-G
  • SUSIE

V. Functional Annotation and Enrichment

A. QTLs and Functional Annotations

  • Enrichment analysis of cis- and trans-QTLs in functional annotations

B. GWAS-Associated SNPs and QTLs

  • GWAS-associated SNPs are enriched among cis-e/sQTLs

C. ENCODE Project

  • Catalogs functional elements in the human genome
  • Provides information on regulatory elements

VI. Partitioning Heritability

A. Methods

  • Understanding contribution of different genetic and functional categories
  • LD Score Regression
  • Functional Annotation Enrichment analysis

Practical Analysis: Download and analyze the GWAS Catalog

Show the code
suppressMessages(library(tidyverse))
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
Show the code
suppressMessages(library(glue))
Warning: package 'glue' was built under R version 4.3.3
Show the code
# Set up working directory
PRE = "~/Downloads/"
DATA = glue("{PRE}/2025-04-21-gwas-catalog-analysis-2025")
if(!file.exists(DATA)) system(glue("mkdir -p {DATA}"))
WORK = DATA

DOWNLOAD_DATE = "2025-04-21"

# Download GWAS catalog data
DATAFILE = "full"
filename = glue("{DATA}/{DATAFILE}")

if(!file.exists(filename)) {
  system(glue("wget -P {DATA} https://www.ebi.ac.uk/gwas/api/search/downloads/full"))
}
Show the code
# Read and explore GWAS catalog data
gwascatalog = read_tsv(filename)
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 796754 Columns: 34
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (23): FIRST AUTHOR, JOURNAL, LINK, STUDY, DISEASE/TRAIT, INITIAL SAMPLE...
dbl   (9): PUBMEDID, UPSTREAM_GENE_DISTANCE, DOWNSTREAM_GENE_DISTANCE, MERGE...
date  (2): DATE ADDED TO CATALOG, DATE

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Show the code
dim(gwascatalog)
[1] 796754     34
Show the code
names(gwascatalog)
 [1] "DATE ADDED TO CATALOG"      "PUBMEDID"                  
 [3] "FIRST AUTHOR"               "DATE"                      
 [5] "JOURNAL"                    "LINK"                      
 [7] "STUDY"                      "DISEASE/TRAIT"             
 [9] "INITIAL SAMPLE SIZE"        "REPLICATION SAMPLE SIZE"   
[11] "REGION"                     "CHR_ID"                    
[13] "CHR_POS"                    "REPORTED GENE(S)"          
[15] "MAPPED_GENE"                "UPSTREAM_GENE_ID"          
[17] "DOWNSTREAM_GENE_ID"         "SNP_GENE_IDS"              
[19] "UPSTREAM_GENE_DISTANCE"     "DOWNSTREAM_GENE_DISTANCE"  
[21] "STRONGEST SNP-RISK ALLELE"  "SNPS"                      
[23] "MERGED"                     "SNP_ID_CURRENT"            
[25] "CONTEXT"                    "INTERGENIC"                
[27] "RISK ALLELE FREQUENCY"      "P-VALUE"                   
[29] "PVALUE_MLOG"                "P-VALUE (TEXT)"            
[31] "OR or BETA"                 "95% CI (TEXT)"             
[33] "PLATFORM [SNPS PASSING QC]" "CNV"                       
Show the code
# Analyze cancer-related entries
gwascatalog %>% 
  select(`DISEASE/TRAIT`, MAPPED_GENE) %>% 
  filter(grepl("cancer", `DISEASE/TRAIT`)) %>% 
  dim()
[1] 13091     2
Show the code
# Count unique cancer loci
# gwascatalog %>% 
#   select(`DISEASE/TRAIT`, MAPPED_GENE) %>% 
#   filter(grepl("cancer", `DISEASE/TRAIT`)) %>% 
#   unique() %>% 
#   dim() 
# this double counts many genes
# will use 

# unique cytogenetic locations (some double counting because of cytogenetic locations reported with different subbands)
gwascatalog %>% 
select(`DISEASE/TRAIT`, REGION) %>% 
filter(grepl("cancer",`DISEASE/TRAIT`)) %>% 
unique() %>% 
dim()
[1] 5536    2
Show the code
# print 
gwascatalog %>% 
select(`DISEASE/TRAIT`, REGION) %>% 
filter(grepl("cancer",`DISEASE/TRAIT`)) %>% 
count(REGION) %>% 
arrange(desc(n))  
# A tibble: 816 × 2
   REGION       n
   <chr>    <int>
 1 <NA>      1012
 2 8q24.21    529
 3 5p15.33    230
 4 11q13.3    157
 5 6p22.1     155
 6 6p21.33    154
 7 17q12      110
 8 6p21.32    106
 9 10q26.13    86
10 19q13.33    83
# ℹ 806 more rows
8q24.21 is a hot spot of cancer susceptibility variants

The REGION column in the GWAS catalog uses standard cytogenetic band notation. For example, “8q24.21” refers to: - Chromosome 8 - Long arm (q) - Band 24 - Sub-band 21

This region (8q24.21) is particularly well-known as it contains multiple cancer susceptibility loci, including those associated with prostate cancer, breast cancer, and colorectal cancer.

Show the code
# Plot GWAS loci by year
gwascat_sig = gwascatalog %>% 
  mutate(year = as.factor(lubridate::year(lubridate::as_date(`DATE ADDED TO CATALOG`)))) %>% 
  filter(`P-VALUE` < 5e-8)

gwascat_sig %>% 
  filter(year != "2024") %>% 
  ggplot(aes(year)) + 
  geom_bar() + 
  theme_bw(base_size = 15) + 
  scale_x_discrete(breaks = c("2008", "2012", "2016", "2020", "2022")) + 
  xlab("Year") + 
  ylab("GWAS loci reported (p < 5e-8)") + 
  ggtitle(paste0("GWAS Catalog Downloaded ", DOWNLOAD_DATE))

© HakyImLab and Listed Authors - CC BY 4.0 License