Author

Liz Gibbons

Published

April 12, 2024

Goals and Introduction

In this lab we are going to use the PRSice software to take height GWAS results and predict height of the Personal Genome Project individuals using their genotype data.

Goals

  • Download data from box
  • Download PRSice
  • Run exploratory analysis of data
  • Run PRSice
  • Plot observed versus predicted for height

Setup

load R packages and define paths

You will have to install all these packages first:

(r code block)

Code
install.packages("tidyverse")
install.packages("RSQLite")
install.packages("glue")

(r code block)

Code
suppressMessages(library(tidyverse))
suppressMessages(library(RSQLite))
suppressMessages(library(glue))

Download data

If you are on a Mac install the mac appropriate binary:

(bash code block)

Code
# wget https://github.com/choishingwan/PRSice/releases/download/2.3.5/PRSice_mac.zip

If you are on Posit cloud do the linux binary:

(bash code block)

Code
wget https://github.com/choishingwan/PRSice/releases/download/2.3.5/PRSice_linux.zip

Unzip the binary: (bash code block)

Code
unzip ./PRSice_linux.zip

move the executable to the directory where you keep your software (optional) ** the first time you run PRSice, it will install additional components make sure PRSice is executable ** chmod u+x PRSice

(bash code block)

Code
chmod +x ./PRSice_linux

Terminal Setup

(bash code block)

Code
export PRE="/cloud/project"
export DATA="$PRE/Lab-personal-genomes-project-data"

export PRSice="/cloud/project/" # file/path/to/PRSice
export PRSice_os="PRSice_linux" # optional: which operating system is your prsice named for
#$PRSice/$PRSice_os --help
cd $DATA
export WORK=$DATA

Exploration of the Data

open the database with phenotype information and list tables

(r code block)

Code
dbname <- paste0(DATA,"/repgp-data.sqlite3") ##This is just to create the file path to the sqlite3 file 
## connect to db
db = RSQLite::dbConnect(SQLite(), dbname)
## list tables
dbListTables(db)

query database users table

(r code block)

Code
dbListFields(db,"users")

(r code block)

Code
query <- function(...) dbGetQuery(db, ...)
users = query("select * from users")
names(users)

look at distribution by count for each column

(r code block)

Code
users %>% count(race)

(r code block)

Code
users  %>% count(gender)

(r code block)

Code
users %>% count(blood)

plot height distribution by gender

(r code block)

Code
users %>% ggplot(aes(height,fill=gender)) + geom_density(alpha=0.6) + ggtitle("Height by gender - Missing gender, *, has bi-modal distr.") + theme_minimal(base_size = 15)

run PRSice

create phenotype data file

create phenotype data file as intersection of the plink formatted fam file and the users table from the database repgp-data.sqlite3

(r code block)

Code
fam = read_tsv(glue::glue("{DATA}/repgp.fam"),col_names = FALSE)

(r code block)

Code
names(fam)[1:2] = c("FID","IID")
fam <- fam %>% select(FID, IID) %>% inner_join(users %>% select(sample,height,weight,gender), by=c("IID"="sample")) 
write_tsv(fam,file=glue::glue("{DATA}/phenodata.txt"))

execute bash command to run PRSice

(bash code block)

Code
mkdir /cloud/project/Lab-personal-genomes-project-data/output

Rscript /cloud/project/PRSice.R --dir /cloud/project \
    --prsice /cloud/project/PRSice_linux \
    --base /cloud/project/Lab-personal-genomes-project-data/ukb_height.gz \
    --target cloud/project/Lab-personal-genomes-project-data/repgp.chr# \
    --snp variant_id \
    --A1 effect_allele \
    --A2 non_effect_allele \
    --stat effect_size \
    --beta \
    --pvalue pvalue \
    --pheno-file /cloud/project/Lab-personal-genomes-project-data/phenodata.txt \
    --pheno-col height \
    --bar-levels 5e-08,5e-07,5e-06,5e-05,5e-04,5e-03,5e-02,5e-01,1 \
    --fastscore \
    --binary-target F \
    --thread 2 \
    --out /cloud/project/Lab-personal-genomes-project-data/output/height_score_all

If you get duplicated SNP ID, follow the instructions

(bash code block)

Code
PRSice="/cloud/project/"
WORK="/cloud/project/Lab-personal-genomes-project-data/"

Rscript $PRSice/PRSice.R --dir $PRSice \
    --prsice $PRSice/PRSice_mac \
    --base $WORK/ukb_height.gz \
    --extract $WORK/output/height_score_all.valid \
    --target $WORK/repgp.chr# \
    --snp variant_id \
    --A1 effect_allele \
    --A2 non_effect_allele \
    --stat effect_size \
    --beta \
    --pvalue pvalue \
    --pheno-file $WORK/phenodata.txt \
    --pheno-col height \
    --bar-levels 5e-08,5e-07,5e-06,5e-05,5e-04,5e-03,5e-02,5e-01,1 \
    --fastscore \
    --binary-target F \
    --thread 2 \
    --out $WORK/output/height_score_all

This may take a couple minutes

plot observed vs predicted height

(r code block)

Code
## we already have the phenotype data in the fam variable
predicted_height = read_delim(glue::glue("{DATA}/output/height_score_all.best"),delim=" ")

(r code block)

Code
combined <- predicted_height %>% inner_join(fam,by=c("IID"="IID")) 
combined %>% ggplot(aes(PRS,height)) + geom_point() + theme_minimal(base_size = 15)

regress observed height with predicted height (PRS)

(r code block)

Code
summary(lm(height ~ PRS,data=combined)) %>% coef()