Host variation data practical session

Dataset description

For this practical, we will be working with genomes and metagenomes from farmed Atlantic salmon. Shotgun metagenomics libraries were prepared from gut content samples. After adapter removal and other quality control steps, we aligned the reads to the salmon reference genome, using methods described in the first half of this session.

The unmapped metagenomic reads were then used to generate MAGs, to characterise the salmon gut microbiome, using a similar pipeline as covered in the Metagenomics session on Day 1.

The mapped genomic reads were used to identify single nucleotide polymorphisms (SNPs) in our samples compared to the published reference genome. Our genomes are relative low coverage (mean: 4X, range: 0.0003-17.3X), so instead of calling genotypes, we generated genotype likelihoods instead, using ANGSD.

Processing steps included:

Identifying callable regions with GATK (i.e. excluding regions of the genome that had extremely low or extremely high coverage)
Generating genotype likelihoods with ANGSD
Linkage-disequilibrium pruning with plink

To identify population structure, we also ran PCangsd on the pruned likelihoods and generated a PCA, which will be used as covariates in GEMMA (see Step 2).

Due to time constraints, we won’t be running these steps in this practical.

Starting files

We will work in the host-variation session directory. Open the terminal and change to this directory:

cd /course/docs/sessions/host-variation/

The following input files are needed to start the practical:

/course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz : mean genotype probabilities file (see Step 0)
genetic_data/GCF_905237065.1_Ssal_v3.1_genomic.gff.gz: salmon reference genome annotation file
genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt: SNP annotation file (all SNPs included in genotype probabilities file with SNP name, position and chromosome ID)
data/individual-phenotype-data.csv: salmon sample metadata (e.g. sea pen, size, etc; also includes PCA coordinates from PCangsd)
data/metagenome-abundance-data.csv: salmon metagenome MAG abundance data

In case you get stuck during the practical, expected output files can be found in expected_output/

Step 0: Generate genotype probabilities

ANGSD produces genotype likelihoods; however, GWAS should be ran on imputed genotype probabilities. We used BEAGLE v3.3.2 (note later versions of BEAGLE do not accept ANGSD .beagle files as input). Again, we have already run this step for you, due to the resources needed (on this dataset, this step took 60 hours with 40 cores & 1000GB RAM).

BEAGLE produces both a gprobs genotype probabilities file and a dose file. The gprobs file typically looks like this:

Example of gprobs file

The first 3 columns correspond to the SNP name and the major and minor alleles (with the alleles coded as 0=A, 1=C, 2=G, 3=T). The other columns correspond to the genotype probabilities for each individual, in sets of three: major/major genotype probability, major/minor genotype probability and minor/minor genotype probability.

However, GEMMA requires the input in “BIMBAM” file format (or PLINK Binary PED format, but we won’t be covering that in this session). This format requires the genotype file to be mean genotypes, also known as genotype doses. View the doses file produced by BEAGLE:

less -S /course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz

Example of dose file

In this format, each SNP for each individual has a value from 0 to 2, based on the following calculation:

mean genotype = 2 x P(minor/minor) + 1 x P(major/minor) + 0 x P(major/major)

The SNP annotation file looks like this, with the SNP name in the first column (ChromosomeName_Position), the position on the chromosome in the second column and the chromosome number in the last column:

head -5 genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt

## NC_059442.1_84727,84727,1
## NC_059442.1_132329,132329,1
## NC_059442.1_141907,141907,1
## NC_059442.1_170531,170531,1
## NC_059442.1_170604,170604,1

Step 1: Prepare microbiome composition data

In a microbiome-GWAS, microbiome composition data (e.g. taxon abundance) are used as the phenotypes of interest. These can be abundance data or presence/absence data generated from any method of characterising the microbiome, including 16S amplicon sequencing, shotgun metagenomics sequencing, etc.

Today, we will use abundance and presence/absence of the MAGs generated from the salmon gut.

If you haven’t already, first activate the conda environment we will be using for this session:

conda activate mgwas-env

And make a directory to store the output in:

mkdir output/

Then open R from the terminal:

In R, load the following libraries:

library(ggplot2)
library(reshape2)

Now load our microbiome data:

metag.data <- read.csv("data/metagenome-abundance-data.csv", row.names = 1)
head(metag.data)

##          MAG01     MAG02 MAG03     MAG04 MAG05     MAG06      MAG07    MAG08
## F010 23.483799 0.0000000     0 0.0000000     0 0.2509768 0.00000000 4.957079
## F101 55.213452 0.1947917     0 0.0000000     0 0.0000000 0.00000000 0.000000
## F102  5.250481 0.0000000     0 0.0000000     0 0.0000000 0.00000000 1.311248
## F104 61.387989 0.0000000     0 0.1076455     0 0.0000000 0.00000000 0.000000
## F108 56.180997 0.4947632     0 0.0000000     0 0.0000000 0.01223235 0.000000
## F011 61.458915 0.0322237     0 0.0000000     0 0.0000000 0.00000000 0.000000
##         MAG09     MAG10    MAG11    MAG12       MAG13 MAG14 MAG15
## F010 0.000000 0.0000000 0.000000 0.000000 0.000000000     0     0
## F101 0.000000 0.0000000 2.513862 0.000000 0.000000000     0     0
## F102 6.277749 0.7441082 0.000000 0.000000 0.000000000     0     0
## F104 0.000000 0.0000000 0.000000 0.000000 0.009031163     0     0
## F108 0.000000 0.0000000 0.000000 1.010648 0.000000000     0     0
## F011 0.000000 0.0000000 0.000000 0.000000 0.000000000     0     0

Each row is a salmon individual and each column has the abundance value of a particular MAG.

Now load the salmon metadata:

salmon.data <- read.csv("data/individual-phenotype-data.csv")
head(salmon.data)

##   Fish.ID Sea.pen Mass.kg Parasite.detected   HostG.PC1   HostG.PC2
## 1    F001    Pen1    3.24              TRUE  0.05975953 -0.01488651
## 2    F010    Pen1    1.77              TRUE  0.06046524 -0.00784219
## 3    F101    Pen1    3.21              TRUE -0.03687514  0.10366554
## 4    F102    Pen1    3.34              TRUE  0.07105788 -0.01421380
## 5    F103    Pen1    5.22              TRUE -0.07668275 -0.07878436
## 6    F104    Pen1    3.57              TRUE -0.06973302 -0.05835064
##      HostG.PC3    HostG.PC4    HostG.PC5 HostG.beagle.order
## 1  0.005764388 -0.002512078 -0.023682409                  1
## 2 -0.002734040  0.006211329  0.010953599                  9
## 3  0.080678337 -0.020533602  0.047165296                 88
## 4 -0.000827069  0.008436523 -0.009851735                 89
## 5  0.000480112  0.007998161  0.031364163                 90
## 6  0.078336013  0.031590117 -0.005405106                 91

Again, each row is the salmon individual, while the columns include information collected about the individual during sampling, such as the sea pen it came from (Sea.pen), it’s mass during sampling (Mass.kg) and whether a parasitic worm was observed in its gut (Parasite.detected). There are also some host genomic variables that we will talk about later.

Let’s plot the microbiome community composition:

metag.long <- melt(as.matrix(metag.data), varnames = c("Fish.ID","MAG"), value.name = "abundance")

ggplot(metag.long, aes(Fish.ID, MAG, fill = abundance))+
  geom_tile()+
  scale_fill_gradient(low="white", high="blue")+
  theme_bw()+
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14, colour = "black"), axis.text.x = element_blank(),
        legend.text = element_text(size = 14, colour = "black"))

The community in these salmon is dominated by MAG01. However, several other MAGs are present in some samples at lower abundances, as can be seen if we view presence/absence of MAGs in the community instead.

metag.long$presence <- ifelse(metag.long$abundance > 0, 1, 0) 
#if abundance is > 0, MAG is present so return value 1; if not, return value 0

ggplot(metag.long, aes(Fish.ID, MAG, fill = as.factor(presence)))+
  geom_tile()+
  scale_fill_manual(values=c("white","blue"))+
  labs(fill = "presence")+
  theme_bw()+
  theme(title = element_text(size = 16), 
        axis.text = element_text(size = 14, colour = "black"), axis.text.x = element_blank(),
        legend.text = element_text(size = 14, colour = "black"))

# MAG presence/absence in wide format
metag.binary <- dcast(metag.long, Fish.ID ~ MAG, value.var = "presence")
row.names(metag.binary) <- metag.binary$Fish.ID
metag.binary <- as.matrix(metag.binary[,-1])

# Most frequently detected MAGs
sort(colSums(metag.binary), decreasing = T)

## MAG01 MAG02 MAG06 MAG11 MAG03 MAG04 MAG08 MAG12 MAG05 MAG13 MAG07 MAG09 MAG10 
##   302   169   139    84    67    56    55    39    37    22    16    16     4 
## MAG14 MAG15 
##     2     1

Normally, I would run microbiome-GWAS with MAG abundances for MAGs detected in at least 50 samples and for MAG presence/absence for MAGs detected in at least 10 samples (these numbers are loose guides). For today’s session, we will only run microbiome-GWAS for:

abundance of MAG01
presence of MAG05

Quantitative phenotypes like abundance should be normalised to a gaussian distribution, while binary phenotypes like presence/absence should be labelled with 0 values as “controls” (in this case, absence of MAG) and 1s as “cases” (in this case, presence of MAG). MAG presence/absence is already in this format in the metag.binary matrix. However, MAG01 abundance is very skewed:

ggplot(metag.data, aes(MAG01))+
  geom_histogram(binwidth = 1)+
  labs(x = "abundance", title = "MAG01")+
  theme_bw()+
  theme(plot.title = element_text(size = 16),
        axis.title = element_text(size = 16), 
        axis.text = element_text(size = 14, colour = "black"),
        legend.text = element_text(size = 14, colour = "black"))

In GWAS studies, it is common to use the Inverse Normal Transformation (INT) to force skewed data into a normal distribution. This transformation basically ranks each sample’s abundance from highest to lowest, then reassigns the abundance value to fit a normal distribution. However, first we should exclude samples where MAG01 was not detected and thus abundance values are 0 (we can label these as NA, then GEMMA will skip these samples).

pheno.for.gemma <- data.frame(
  Fish.ID=row.names(metag.data),
  MAG01_abundance=metag.data$MAG01
  )

pheno.for.gemma[which(pheno.for.gemma$MAG01_abundance == 0),"MAG01_abundance"] <- NA

# Function for INT
qtrans <-
  function(x){
    k<-!is.na(x)
    ran<-rank(x[k])
    y<-qnorm((1:sum(k)-0.5)/sum(k))
    x[k]<-y[ran]
    x
  }

pheno.for.gemma$MAG01_INT=qtrans(pheno.for.gemma$MAG01_abundance)

ggplot(pheno.for.gemma, aes(MAG01_INT))+
  geom_histogram(binwidth = 0.1)+
  labs(x = "INT-abundance", title = "MAG01")+
  theme_bw()+
  theme(plot.title = element_text(size = 16),
        axis.title = element_text(size = 16), 
        axis.text = element_text(size = 14, colour = "black"),
        legend.text = element_text(size = 14, colour = "black"))

## Warning: Removed 3 rows containing non-finite values (stat_bin).

The sample order for all GEMMA inputs needs to be the same, as Sample IDs are not provided. This includes the SNP data input beagle file. If this file was generated with ANGSD, then the sample order will be the same as the file input order used in the original ANGSD command. For this dataset, the sample order is stored as the variable HostG.beagle.order in salmon.data. We can add the MAG05 presence data to the pheno.for.gemma dataframe, then reorder the dataframe to match HostG.beagle.order. We can then write this to a text file for GEMMA.

# Add MAG05 presence data
pheno.for.gemma <- cbind.data.frame(
  pheno.for.gemma[match(pheno.for.gemma$Fish.ID, row.names(metag.binary)),], 
  data.frame(MAG05_presence=metag.binary[match(pheno.for.gemma$Fish.ID, row.names(metag.binary)),"MAG05"])
  )

pheno.for.gemma <- merge(pheno.for.gemma, salmon.data[,c("Fish.ID","HostG.beagle.order")], all = T)
# Reorder to match SNP data file
pheno.for.gemma <- pheno.for.gemma[order(pheno.for.gemma$HostG.beagle.order),]

# Write phenotypes to text file
# Tab delimited, don't include colnames
write.table(pheno.for.gemma[,c("Fish.ID","MAG01_INT","MAG05_presence")], 
            "microbiome-phenotypes-for-gemma.txt", 
            sep = "\t", row.names = F, quote = F, col.names = F)

Step 2: Prepare individual covariate data

GEMMA can take into account covariates that may affect the output, e.g. batch effects or other experimental factors. For the linear model, GEMMA needs an intercept term. If no covariates are included, GEMMA will automatically include an intercept. However, if a covariates file is included, then it must include the intercept term as the first column (usually a column of 1s). For this dataset, salmon were from two different pens, which might introduce some pen-specific bias. Some individuals were also infected by a gut parasite. We also have individuals with a range of sizes (from 1 to 10 kg in mass). We know from previous studies that pen environment and parasite infection can affect microbiome composition, and size can be influenced by genetics, so we will include these variables as covariates. Size can remain as a continuous variable, but parasite detection and sea pen should be converted to binary 0s and 1s.

# include the intercept term, mass and convert sea pen & parasite detection to binary 
covariates.for.gemma <- data.frame(
  Fish.ID=salmon.data$Fish.ID,
  Intercept=1,
  Mass.kg=salmon.data$Mass.kg,
  Sea.pen=ifelse(salmon.data$Sea.pen == "Pen1",0,1),
  Parasite.detected=ifelse(salmon.data$Parasite.detected,1,0)
)

GWAS can also be affected by population genetic structure (see Price et al. 2006), which should also be taken into account in the model. Previously population structure was evaluated for this dataset using PCangsd, to produce a PCA. From this PCA, we can see that the salmon population is highly related, but some population structure is evident:

ggplot(salmon.data, aes(HostG.PC1, HostG.PC2, fill=Mass.kg, shape=Sea.pen))+
  geom_vline(xintercept = 0, linetype = 3, size = 1, color = "grey42")+
  geom_hline(yintercept = 0, linetype = 3, size = 1, color = "grey42")+
  geom_point(size = 3, color = "black")+
  scale_shape_manual(values = c(21,24))+
  scale_fill_gradient(low = "blue", high = "yellow")+
  labs(x = "PC1 (5.5%)", y = "PC2 (3.2%)", shape = "Sea pen", fill = "Mass (kg)")+
  theme_bw()+
  theme(axis.text = element_text(color = "black"))

The number of PCs to include in a GWAS depends very much on the population, and there is not great agreement in the literature on how many to use. Since our population is highly related, we will use additional methods to control for population structure (relatedness matrices, see Step 3). We will therefore only include the first 5 PCs from the PCA as covariates.

covariates.for.gemma <- cbind.data.frame(
  covariates.for.gemma,
  salmon.data[,c("HostG.PC1","HostG.PC2","HostG.PC3","HostG.PC4","HostG.PC5")]
)

And again, we need to make sure the final covariates file has the same sample order as the SNP data file. Then we can write it to a text file for GEMMA input.

# Reorder to match SNP data file
covariates.for.gemma <- covariates.for.gemma[match(
  salmon.data[order(salmon.data$HostG.beagle.order),"Fish.ID"], 
  covariates.for.gemma$Fish.ID
  ),]

# Write covariates to text file
# Tab delimited, don't include colnames, don't include Fish.ID
write.table(covariates.for.gemma[,-1], 
            "covariates-for-gemma.txt", 
            sep = "\t", row.names = F, quote = F, col.names = F)

Now we have finished preparing our data for GEMMA, we can leave R and go back to the terminal to run GEMMA.

To quit R, run the following (at the prompt, type ‘n’ to not save the workspace):

q()

Step 3: Run GEMMA

GEMMA must be run separately for each phenotype. Let’s start with MAG01 abundance.

GEMMA run for MAG01 abundance

We will run GEMMA in two steps. First, a relatedness matrix must be generated. This corrects for fine-level population structure not captured by PCA. Like with many farmed animals, the salmon in this dataset are highly related, so it’s particularly important to account for relatedness in this population.

Linear mixed models like GEMMA can overcorrect p-value inflation when the relatedness matrix is run on the same regions of the genome as the association test. It is therefore common to use a ‘leave-one-chromosome-out’ or LOCO strategy. With LOCO, the association test is run separately for each chromosome, and all the other chromosomes are used for the relatedness matrix calculation. For example, today we will run GEMMA on chr5, so we will use chr 1-4 & 6-30 for the relatedness matrix.

First we need to create two files: a list of the SNPs on chr5 to be included in the association test, and a list of SNPs not on chr5 to be used for the relatedness matrix. We can create these lists from the SNP annotation file.

awk -F"," '{if($3==5) print $1}' \
   genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt > snps_for_gwas_chr5.txt

awk -F"," '{if($3!=5) print $1}' \
   genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt > snps_for_relatedmatrix_NOchr5.txt

Now we can generate the relatedness matrix with the following parameters:

-g: genotype probabilities file
-p: phenotype file (GEMMA will exclude samples with a missing phenotype from the calculation)
-n: column in the phenotype file that should be used as the phenotype (in our case, column 1 is the sample IDs, column 2 is MAG01 abundance and column 3 is MAG05 presence)
-snps: list of SNPs to use for the relatedness matrix calculation
-gk: which method to use to calculate the relatedness matrix. -gk 1 uses the centered relatedness matrix and is preferred when the SNP effect size does not depend on its minor allele frequency (see the manual for details). -gk 2 uses the standardized relatedness matrix and is preferred when SNPs with lower minor allele frequency tend to have larger effects. Today we will use the centered relatedness matrix.
-outdir: path of the output directory
-o: output prefix

gemma -g /course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz \
  -p microbiome-phenotypes-for-gemma.txt -n 2 \
  -snps snps_for_relatedmatrix_NOchr5.txt -gk 1 \
  -outdir output/ -o MAG01_gk1_NOchr5

This step should take approx. 5 mins. The command produces a log file and a .cXX.txt file containing the relatedness matrix.

Then we can do the main GEMMA association test. Today we will use a univariate linear mixed model, with the following parameters:

-g: genotype probabilities file
-p: phenotype file
-n: column in the phenotype file that should be used as the phenotype (in our case, column 2 for MAG01 abundance and column 3 for MAG05 presence)
-snps: list of SNPs to use for the association test
-k: relatedness matrix file from above (has the extension *.cXX.txt)
-a: SNP annotation file
-c: covariates file
-lmm: which p-value test to run. Today we will use the Wald test with -lmm 1 (see the GEMMA manual for other options)
-outdir: path of the output directory
-o: output prefix

gemma -g /course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz \
  -p microbiome-phenotypes-for-gemma.txt -n 2 \
  -snps snps_for_gwas_chr5.txt -k output/MAG01_gk1_NOchr5.cXX.txt \
  -a genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt \
  -c covariates-for-gemma.txt -lmm 1 \
  -outdir output/ -o MAG01_abundance_chr5

This step should take approx. 2 mins. The command produces another log file and a .assoc.txt file containing the results of the association test.

View the start of the association results:

head -5 output/MAG01_abundance_chr5.assoc.txt

## chr  rs  ps  n_miss  allele1 allele0 af  beta    se  logl_H1 l_remle p_wald
## 5    NC_059446.1_64608   64608   0   1   2   0.216   -1.437943e-01   1.033518e-01    -3.768229e+02   1.000000e-05    1.651907e-01
## 5    NC_059446.1_64680   64680   0   3   1   0.380   6.764786e-02    9.038766e-02    -3.775078e+02   1.000000e-05    4.548097e-01
## 5    NC_059446.1_69918   69918   0   2   0   0.059   1.367337e-01    2.014546e-01    -3.775574e+02   1.000000e-05    4.978452e-01
## 5    NC_059446.1_69939   69939   0   3   1   0.055   5.639072e-02    2.306739e-01    -3.777577e+02   1.000000e-05    8.070457e-01

The columns correspond to:

chr: chromosome number (all chr 5 in this case)
rs: SNP name
ps: SNP position on the chromosome
n_miss: number of missing values for a given SNP (i.e. SNPs that do not pass the quality control filters - see the GEMMA manual for details)
allele1: minor allele, coded in the same way as the genotype probabilities file
allele0: major allele, coded in the same way as the genotype probabilities file
af: allele frequency
beta: beta estimates
se: standard errors for beta
logl_H1: log likelihood under the alternative hypothesis as a measure of goodness of fit
l_remle: REML estimates for lambda
p_wald: p-values from Wald test

For GWAS, the distribution of p-values are what we use to identify SNPs that might be associated with our phenotypes of interest.

Normally, we would then repeat these two commands for all chromosomes and concatenate the results, but since this will take around 2 hours to run (unless run in parallel), we won’t do so today. But you could write a for loop to run on per chromosome, like this:

## EXAMPLE ONLY, DO NOT RUN
chrs=$(awk -F"," '{print $3}' genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt | uniq)
for c in $chrs;
do
  # create SNP lists for relatedness matrix & association test
  grep "^$c" -v genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt \
    | awk -F"," '{print $1}' > snps_for_relatedmatrix_NOchr${c}.txt
  grep "^$c" genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt \
    | awk -F"," '{print $1}' > snps_for_gwas_chr${c}.txt
  # run relatedness matrix
  gemma -g /course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz \
    -p microbiome-phenotypes-for-gemma.txt -n 2 \
    -snps snps_for_relatedmatrix_NOchr${c}.txt -gk 1 \
    -outdir output/ -o MAG01_gk1_NOchr${c}
   # run gemma association test
   gemma -g /course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz \
     -p microbiome-phenotypes-for-gemma.txt -n 2 \
     -snps snps_for_gwas_chr${c}.txt -k output/MAG01_gk1_NOchr${c}.cXX.txt \
     -a genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt \
     -c covariates-for-gemma.txt -lmm 1 \
     -outdir output/ -o MAG01_abundance_chr${c};
done

# concatenate all chrs (save header on first file, then skip header on all others)
head -n 1 output/MAG01_abundance_chr1.assoc.txt > output/MAG01_abundance_all.assoc.txt
tail -n +2 -q output/MAG01_abundance_chr*.assoc.txt >> output/MAG01_abundance_all.assoc.txt

GEMMA run for MAG05 presence/absence

Now let’s repeat the two commands, but for MAG05 presence/absence. This means we should change -n to column 3 and change our output prefix.

gemma -g /course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz \
  -p microbiome-phenotypes-for-gemma.txt -n 3 \
  -snps snps_for_relatedmatrix_NOchr5.txt -gk 1 \
  -outdir output/ -o MAG05_gk1_NOchr5

gemma -g /course/host-variation-extra-data/genotype_probabilities_all_LDpruned_noheader.dose.gz \
  -p microbiome-phenotypes-for-gemma.txt -n 3 \
  -snps snps_for_gwas_chr5.txt -k output/MAG05_gk1_NOchr5.cXX.txt \
  -a genetic_data/genotype_probabilities_all_LDpruned_SNPanno.txt \
  -c covariates-for-gemma.txt -lmm 1 \
  -outdir output/ -o MAG05_presence_chr5

Again, this is just for chr 5, but normally we would run GEMMA on each chromosome. We have provided some example plots of the results of running GEMMA on all chromosomes for these two MAG variables in the next step.

Step 4: Visualise results

To plot the results of our GWAS runs, we will go back into R or RStudio and use the qqman package.

Once in R, load the following libraries:

library(reshape2)
library(ggplot2)
library(qqman)

Again, we will start with MAG01.

Results for MAG01 abundance

Read in the GEMMA results file:

res <- read.delim("output/MAG01_abundance_chr5.assoc.txt")

To evaluate how well the GWAS ran and whether we can trust the results, we plot the actual p-values against a theoretical or expected distribution. This is called a qqplot. We can also calculate lambda, the inflation value, which gives us a numerical summary of the qqplot. Ideally this value should be ~ 1. Values < 1 suggests something is wrong with the model used for the association test. Values > 1 mean that the p-values are inflated and many are likely false positives.

The inflation value can be calculated with the following function:

# define function
inflation <- function(ps) {
  chisq <- qchisq(1 - ps, 1)
  lambda <- median(chisq) / qchisq(0.5, 1)
  lambda
}

# run function
i <- inflation(res$p_wald)
print(i)

## [1] 1.075436

The inflation value is 1.075.

We can then generate a qqplot with the following function:

# define function
gg_qqplot <- function(ps, ci = 0.95) {
  n  <- length(ps)
  df <- data.frame(
    observed = -log10(sort(ps)),
    expected = -log10(ppoints(n)),
    clower   = -log10(qbeta(p = (1 - ci) / 2, shape1 = 1:n, shape2 = n:1)),
    cupper   = -log10(qbeta(p = (1 + ci) / 2, shape1 = 1:n, shape2 = n:1))
  )
  log10Pe <- expression(paste("Expected -log"[10], plain(P)))
  log10Po <- expression(paste("Observed -log"[10], plain(P)))
  ggplot(df) +
    geom_ribbon(
      mapping = aes(x = expected, ymin = clower, ymax = cupper),
      alpha = 0.1
    ) +
    geom_point(aes(expected, observed), shape = 1, size = 3) +
    geom_abline(intercept = 0, slope = 1, alpha = 0.5) +
    xlab(log10Pe) +
    ylab(log10Po)
}

# plot
gg_qqplot(res$p_wald)

We can see that observed p-values match the expected p-values pretty well.

Now we know our GWAS model is good, we can view the Manhattan plot to visualise SNPs significantly associated with the phenotype.

First we define a “significance” p-value threshold to highlight interesting SNPs. In typical (usually human) GWAS, 5e-8 is often used, given the large number of hypotheses being tested. However, there are many other methods, including more formal adjustments for multiple hypothesis testing (for further reading, see van der Berg et al. 2019). Today, we will stick with p < 5e-8.

res$sigSNP <- res$p_wald < 5e-8

manhattan(res, ylim = c(0,9), chr="chr", bp="ps", p="p_wald", snp="sigSNP")

The log of the p-values is on the y axis, and the position of each SNP along chr 5 is plotted on the x axis. A red line is drawn at our significant threshold, p = 5e-8 (7.3 in log-transformed space). A blue line is drawn at p = 1e-5 (5 in log-transformed space). This is a less conservative threshold that is sometimes used in GWAS. As can be seen, no SNPs on chr5 are above even the blue line, suggesting that no SNPs on chr5 are associated with MAG01 abundance.

If we had run GEMMA on all chromosomes, we would see a qqplot and Manhattan plot that look like this (in this qqplot, the “blue line” of p-value significance is included):

Example of MAG01 abundance qqplot, all chromosomes

Example MAG01 abundance Manhattan plot, all chromosomes

As can be seen, no SNPs pass our significance threshold (red line), and few are even above the less conservative threshold (blue line). So, we won’t take these results further.

Results for MAG05 presence/absence

Now let’s looks at the results for MAG05 presence. We run the same commands as before (we don’t have to re-run the functions if we are in the same workspace, as they are already loaded).

# read in the data
res <- read.delim("output/MAG05_presence_chr5.assoc.txt")

# get the inflation value
i <- inflation(res$p_wald)
print(i)

## [1] 1.272533

# qqplot
gg_qqplot(res$p_wald)+
   geom_hline(yintercept = -log10(1e-5), colour = "blue")

The inflation value is 1.273. We can tell from this and the qqplot that our p-values are a little inflated. However, had we run GEMMA on all chromosomes, the inflation value would be 1.043. The corresponding qqplot would look like this:

Example of MAG05 presence qqplot, all chromosomes This suggests that the model is ok, we just needed more data to evaluate it properly.

Now we can generate the Manhattan plot for chr 5:

res$sigSNP <- res$p_wald < 5e-8

manhattan(res, ylim = c(0,9), chr="chr", bp="ps", p="p_wald", snp="sigSNP")

We can see that at least 2 SNPs are significant on chr 5 and several neighbouring SNPs are tending towards significance (i.e. above the blue line). To put this in context of the entire genome, here are the results from running GEMMA on all chromosomes:

Example of MAG05 presence Manhattan plot, all chromosomes

Only the SNPs forming a peak on chr 5 are significant. We will look into the SNPs on this peak in more detail, by looking at what genes they fall in. We will include the SNPs above the blue line (p < 1e-5).

We can extract the positions of significant SNPs to save as a BED file, then we can use BEDTools to pull down their annotations from the reference genome gff file. BED files consist of the chr name, the start position and the end position of the region of interest. Today we will just extract positions 1 bp to either side of the significant SNPs (win = 1), but you can use a larger flanking window to identify e.g. SNPs in promoter regions of genes.

Note that BED files are zero indexed (i.e. the first base on the chromosome is position 0), whereas all ANGSD-associated files (and therefore GEMMA output files) record the first base on a chromosome as position 1 (more info available here). So we need to minus one from each SNP position. You should also be aware that BED files read a region from start_position to before end_position (e.g. if start_position = 10 and end_position = 15, BEDTools will return information for positions 10, 11, 12, 13 and 14).

# define function
print_bed_f <- function(snp.out, win = 1, tab.name = "output.bed"){
  # format output from ANGSD doAsso "significant" SNPs (user-defined) as a BED file
  
  # note we need the chr NAME for the BED file, not the chr number (can get from the marker name)
  snp.out$Chromosome <- sapply(snp.out$rs, function(x){
    paste0(strsplit(as.character(x),"_")[[1]][1:2], collapse = "_") 
    })
  
  # calculate start and end positions based on window size
  df <- snp.out[,c("Chromosome","ps")]
  # convert ANGSD positions to BED zero indexed positions
  df$ps_bed <- df$ps - 1
  if(win == 1){
    df$start_pos <- df$ps_bed
  } else {
    df$start_pos <- df$ps_bed - win
  }
  df$stop_pos <- df$ps_bed + win
  
  write.table(df[,c("Chromosome","start_pos","stop_pos")], 
              file = tab.name, sep = "\t", 
              row.names = FALSE, col.names = FALSE, quote = FALSE)
}

print_bed_f(subset(res, p_wald < 1e-5), win = 1, tab.name = "output/MAG05_presence_sigSNPs.bed")

Step 5: Extract SNP annotation

Back in the terminal, we can use BEDTools intersect to extract annotations in the salmon reference genome from the regions in our BED file.

bedtools intersect -wb -a genetic_data/GCF_905237065.1_Ssal_v3.1_genomic.gff.gz \
  -b output/MAG05_presence_sigSNPs.bed > output/MAG05_presence_sigSNPs_anno.txt

head -n 5 output/MAG05_presence_sigSNPs_anno.txt

## NC_059446.1  RefSeq  region  15487786    15487786    .   +   .   ID=NC_059446.1:1..92788608;Dbxref=taxon:8030;Name=ssa05;chromosome=ssa05;gbkey=Src;genome=chromosome;mol_type=genomic DNA   NC_059446.1 15487785    15487786
## NC_059446.1  RefSeq  region  16099761    16099761    .   +   .   ID=NC_059446.1:1..92788608;Dbxref=taxon:8030;Name=ssa05;chromosome=ssa05;gbkey=Src;genome=chromosome;mol_type=genomic DNA   NC_059446.1 16099760    16099761
## NC_059446.1  RefSeq  region  16256914    16256914    .   +   .   ID=NC_059446.1:1..92788608;Dbxref=taxon:8030;Name=ssa05;chromosome=ssa05;gbkey=Src;genome=chromosome;mol_type=genomic DNA   NC_059446.1 16256913    16256914
## NC_059446.1  RefSeq  region  16274373    16274373    .   +   .   ID=NC_059446.1:1..92788608;Dbxref=taxon:8030;Name=ssa05;chromosome=ssa05;gbkey=Src;genome=chromosome;mol_type=genomic DNA   NC_059446.1 16274372    16274373
## NC_059446.1  RefSeq  region  16363100    16363100    .   +   .   ID=NC_059446.1:1..92788608;Dbxref=taxon:8030;Name=ssa05;chromosome=ssa05;gbkey=Src;genome=chromosome;mol_type=genomic DNA   NC_059446.1 16363099    16363100

The annotation file is a little messy, so we can use the following awk and sed commands to view the results in a quick, cleaner format:

awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"a[6]}' \
  output/MAG05_presence_sigSNPs_anno.txt | sed 's/ID=//' | sed 's/Name=//' \
  | sed 's/gene_biotype=//'| sed 's/ //g' | sed '1igene_id\tGeneSymbol\tChromosome\tClass'

## gene_id  GeneSymbol  Chromosome  Class
## gene-LOC106604312    LOC106604312    NC_059446.1:15487786-15487786   protein_coding
## gene-LOC123743081    LOC123743081    NC_059446.1:16099761-16099761   protein_coding
## gene-LOC123743236    LOC123743236    NC_059446.1:16274373-16274373   pseudogene
## gene-LOC106604348    LOC106604348    NC_059446.1:17334765-17334765   protein_coding
## gene-LOC106604348    LOC106604348    NC_059446.1:17347221-17347221   protein_coding
## gene-LOC106604348    LOC106604348    NC_059446.1:17359592-17359592   protein_coding

Some SNPs might not be annotated. Here we get results for 6 SNPs, 2 which are located on different protein-coding genes, 1 on a pseudogene and 3 which are located on the same protein-coding gene.

The protein-coding genes have gene names in the associated mRNA transcripts records, so we can pull this information from the annotation as well:

awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"mRNA") print a[length(a)-4]"\t"a[length(a)-3]"\t"a[length(a)-1]}' \
  output/MAG05_presence_sigSNPs_anno.txt | sed 's/gene=//' | sed 's/gbkey=//' | sed 's/product=//' \
  | sed 's/ //g' | sed '1itype\tGeneSymbol\tProductName'

## type GeneSymbol  ProductName
## mRNA LOC106604312    ankyrinrepeatandKHdomain-containingprotein1-like
## mRNA LOC123743081    teneurin-2-like
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX2
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX2
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX2
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX4
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX4
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX4
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX1
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX1
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX1
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX3
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX3
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX3
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX5
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX5
## mRNA LOC106604348    glutamatereceptor1-like%2CtranscriptvariantX5

If we look the gene names up on UniProt, we can learn more about what these 3 genes do (at least in humans and mice):

ankyrin repeat and KH domain-containing protein 1 (ANKHD1) like
teneurin-2-like
glutamate receptor1-like (note that the annotation record shows multiple variants exist of this mRNA transcript)

We can then search the literature to generate hypotheses about what these genes might be doing in relation to the microbiome in our host organism.

EBI-HoloFood-HostG

Instructor: Jaelle Brealey

August 30, 2022