10.21 Homework

10.21.0.1 Learning Objectives

Practice calculating introgression statistics in admixr
Interpret the biological significance of region-specific values of the \(f_4\)-ratio

10.21.0.2 Assignment

Follow these steps to create your own genome stratifications for calculating the \(f_4\) ratio statistic.

Go to the UCSC Table Browser, where you can find a wide selection of annotations for the human genome.
Make sure you set the assembly: drop-down box to Feb. 2009 (GRCh37/hg19).
Use the group: and track: menus to select any set of genomic regions.
- You can click the data format description button and scroll to the Description section to find out what each annotation represents.
Under the Retrieve and display data section, set the output format: to BED.
Enter an output filename: (ex: all_genes.bed).
Click get output to download the file.
In Posit Cloud, upload your file using the Upload button in the Files panel (bottom right).
Run the code block below to reformat the BED file. The code matches the UCSC’s chromosome naming format with the format used in the snps data:

# fill in blank with the name of your bed file
system(command = "sed -i 's/chr//g' ________")

# get the path to your bed file
bed <- file.path("________")

Compute the \(f_4\) ratio statistic within and outside of the genomic intervals. Repeat for another set of genome annotations to contrast Neanderthal ancestry in different genomic elements.

Solution

Download tracklist of haploinsufficient genes (Phenotype and Literature -> Haploinsufficiency).

# get the path to the `regions.bed` file
bed <- file.path("haploinsufficient.bed")

# option 1: KEEP only these regions for analysis
new_snps_keep <- filter_bed(snps, bed)
# option 2: REMOVE these regions from analysis
new_snps_remove <- filter_bed(snps, bed, remove = TRUE)

Re-calculate the \(f_4\)-ratio:

# f4-ratio with the regions kept
f4_keep <- f4ratio(data = new_snps_keep,
                       X = pops, A = "Altai", B = "Vindija", C = "Yoruba", O = "Chimp") %>%
  # convert z score to pvalue
  mutate(p = 2 * pnorm(-abs(Zscore)))

f4_keep

##       A       B           X      C     O     alpha   stderr Zscore            p
## 1 Altai Vindija      French Yoruba Chimp  0.030580 0.009385  3.258 1.122004e-03
## 2 Altai Vindija   Sardinian Yoruba Chimp  0.018057 0.009117  1.981 4.759127e-02
## 3 Altai Vindija         Han Yoruba Chimp  0.023127 0.009617  2.405 1.617247e-02
## 4 Altai Vindija      Papuan Yoruba Chimp  0.036462 0.008964  4.068 4.741838e-05
## 5 Altai Vindija Khomani_San Yoruba Chimp -0.001223 0.008774 -0.139 8.894501e-01
## 6 Altai Vindija       Mbuti Yoruba Chimp -0.015781 0.009100 -1.734 8.291808e-02
## 7 Altai Vindija       Dinka Yoruba Chimp -0.004929 0.008235 -0.599 5.491729e-01

# f4-ratio with the regions removed
f4_remove <- f4ratio(data = new_snps_remove,
                       X = pops, A = "Altai", B = "Vindija", C = "Yoruba", O = "Chimp") %>%
  # convert z score to pvalue
  mutate(p = 2 * pnorm(-abs(Zscore)))

f4_remove

##       A       B           X      C     O     alpha   stderr Zscore            p
## 1 Altai Vindija      French Yoruba Chimp  0.016554 0.007929  2.088 3.679783e-02
## 2 Altai Vindija   Sardinian Yoruba Chimp  0.027980 0.007826  3.575 3.502279e-04
## 3 Altai Vindija         Han Yoruba Chimp  0.020285 0.007348  2.761 5.762468e-03
## 4 Altai Vindija      Papuan Yoruba Chimp  0.036136 0.007550  4.786 1.701381e-06
## 5 Altai Vindija Khomani_San Yoruba Chimp  0.005713 0.007593  0.752 4.520511e-01
## 6 Altai Vindija       Mbuti Yoruba Chimp  0.009803 0.007233  1.355 1.754176e-01
## 7 Altai Vindija       Dinka Yoruba Chimp -0.000663 0.006798 -0.097 9.227264e-01

Some of the alpha values for each population change when excluding/restricting to haploinsufficient genes, but their standard error ranges still overlap between the two \(f_4\)-ratio calculations, so they likely aren’t truly different.