4.15 Homework

4.15.0.1 Learning Objectives

  • Calculate and interpret LD statistics

4.15.0.2 Assignment

We’ve subset the VCF from class to show haplotypes for two different pair of SNPs:

  • chr21:15005329 and chr21:15007704
  • chr21:15336586 and chr21:15336794
# read data for first set of SNPs
hw1 <- read.table("snp_haplotypes_hw1.txt", header = TRUE)

# read data for second set of SNPs
hw2 <- read.table("snp_haplotypes_hw2.txt", header = TRUE)

Assignment: Using the code from class, calculate \(D\), \(D'\), and \(r^2\) for these sets of SNPs. Which alleles are segregating together? What does each LD statistic indicate? (Feel free to check your work on LDpair.)


Solution for first set of SNPs

First use table to count the occurences of the four haplotypes.

table(hw1$snp1_allele, hw1$snp2_allele)
##    
##        A    G
##   C  747 2508
##   T    2 1751

All four possible haplotypes exist in this population.

\(\mathbf{D = h_{12} - p_1*p_2}\)

h <- 747 / 5008
p1 <- (747 + 2508)/5008
p2 <- (747 + 2)/5008

D <- h - p1 * p2
D
## [1] 0.05195286

\(D\) is non-zero, which suggests that these SNPs might be in LD.

\(\mathbf{D' = \frac{D}{\mathrm{min}(p_1 (1-p_2), p_2(1-p_1))}}\) (because \(D > 0\))

First we determine the denominator by calculating which of \(p_1 (1-p_2)\) and \(p_2(1-p_1)\) is smaller:

p1 * (1-p2)
## [1] 0.5527516
p2 * (1-p1)
## [1] 0.05235222

\(p_2(1-p_1)\) is smaller, so we use it for the denominator. \(D'\) is:

Dprime <- D / ((1-p1) * p2)
Dprime
## [1] 0.9923717

\(D' = 0.99\), which indicates very high LD (almost no recombination has occurred between alleles on this haplotype).

\(\mathbf{r^2 = \frac{D^2}{p_1 (1-p_1) q_1 (1-q_1)}}\)

r2 <- D^2 / (p1 * (1-p1) * p2 * (1-p2))
r2
## [1] 0.09327254

However, \(r^2 = 0.09\) (linkage equilibrium)! This is because one of the haplotypes, A T, is very rare – there are only two copies in the population. \(r^2\) tells us that the counts of the A T haplotype are so low that an A at SNP1 doesn’t do a great job of predicting when SNP2 is T.



Solution for second set of SNPs

First use table to count the occurences of the four haplotypes.

table(hw2$snp1_allele, hw2$snp2_allele)
##    
##        A    G
##   A 3522    0
##   G    0 1486

The only haplotypes that exist in this population are A C and G G.

\(\mathbf{D = h_{12} - p_1*p_2}\)

h <- 3522 / 5008
p1 <- (3522 + 0)/5008
p2 <- (3522 + 0)/5008

D <- h - p1 * p2
D
## [1] 0.2086794

\(D\) is non-zero, which suggests that these SNPs might be in LD.

\(\mathbf{D' = \frac{D}{\mathrm{min}(p_1 (1-q_1), (1-p_1)q_1 )}}\) (because \(D > 0\))

First we determine the denominator by calculating which of \(p_1 (1-p_1)\) and \(p_2 (1-p_2)\) is smaller:

p1 * (1-p2)
## [1] 0.2086794
p2 * (1-p1)
## [1] 0.2086794

The two values are exactly the same, so we can use either for the denominator. \(D'\) is:

Dprime <- D / (p1 * (1-p2))
Dprime
## [1] 1

\(D' = 1\)! These SNPs are in maximum LD (no recombination has occured between them).

\(\mathbf{r^2 = \frac{D^2}{p_1 (1-p_1) p_2 (1-p_2)}}\)

r2 <- D^2 / (p1 * (1-p1) * p2 * (1-p2))
r2
## [1] 1

\(r^2 = 1\)! These SNPs are in maximum LD (everyone who carries an A at SNP1 has an A at SNP2, and everyone with a G at SNP1 has a G at SNP2).