• 1 Homepage
  • 2 Genome browsers
    • 2.1 DNA sequencing data
    • 2.2 Assembling a genome
    • 2.3 The human reference genome
    • 2.4 UCSC genome browser
      • 2.4.1 Homepage
    • 2.5 Viewing one region of the genome
    • 2.6 IGV
      • 2.6.1 Homepage
    • 2.7 Navigating IGV
    • 2.8 Loading sequencing data
    • 2.9 The 1000 Genomes Project
    • 2.10 SRA
      • 2.10.1 Previewing sequencing data
    • 2.11 Viewing sequencing reads in IGV
    • 2.12 Interpreting IGV alignments
    • 2.13 Conclusion
    • 2.14 Homework
  • 3 Discovering mutations
    • 3.1 De novo mutations
    • 3.2 Recombination
    • 3.3 Setup
      • 3.3.1 R packages
      • 3.3.2 Data
    • 3.4 Visualizing the data
    • 3.5 Linear models
    • 3.6 Fitting a linear model for DNMs
    • 3.7 Confidence intervals
    • 3.8 Calculate 95% CIs
    • 3.9 Conclusion
    • 3.10 Homework
    • 3.11 Required homework
    • 3.12 Optional homework
  • 4 Linkage disequilibrium
    • 4.1 What is linkage disequilibrium?
    • 4.2 Why do we care about LD?
    • 4.3 Setup
    • 4.4 Are these SNPs in LD?
    • 4.5 Counting haplotypes with table
    • 4.6 Fisher’s exact test
    • 4.7 Measuring LD with \(D\)
    • 4.8 Calculating \(D\)
    • 4.9 Measuring LD with \(D'\)
    • 4.10 Measuring LD with \(r^2\)
    • 4.11 LDlink
    • 4.12 Visualizing LD blocks
    • 4.13 LD in association studies
    • 4.14 Conclusion
    • 4.15 Homework
  • 5 Simulating evolution
    • 5.1 Genetic drift
    • 5.2 The Wright-Fisher model
    • 5.3 Allele frequency, fixation, and loss
    • 5.4 Modeling allele frequencies
    • 5.5 The binomial distribution
    • 5.6 Setup
      • 5.6.1 R packages
      • 5.6.2 Data
    • 5.7 The rbinom function
    • 5.8 Increasing population size
    • 5.9 Simulating multiple generations
    • 5.10 For loops
    • 5.11 Updating variables within a for loop
    • 5.12 Adding a population size variable
    • 5.13 Changes in AF over generations
    • 5.14 Storing AFs in a vector
    • 5.15 Reformatting AFs for plotting
    • 5.16 Plotting AF trajectory
    • 5.17 Simulating different parameters with a function
    • 5.18 Creating a Wright-Fisher function
    • 5.19 Running a function
    • 5.20 Conclusion
    • 5.21 Homework
  • 6 Population structure
    • 6.1 What is a population?
    • 6.2 Population structure
    • 6.3 Geography of Genetic Variants
    • 6.4 Setup
      • 6.4.1 R packages
    • 6.5 Genotype data
    • 6.6 Metadata
    • 6.7 The allele frequency spectrum
    • 6.8 Theoretical AFS
    • 6.9 AF correlations between populations
    • 6.10 Common variation
    • 6.11 Principal component analysis
    • 6.12 Reformatting data for PCA
    • 6.13 Performing PCA
    • 6.14 Reformatting PCA output
    • 6.15 Annotate with population labels
    • 6.16 PCA plot
    • 6.17 Proportion of variance explained
    • 6.18 Conclusion
    • 6.19 Homework
    • 6.20 Required homework
    • 6.21 Optional homework
  • 7 Genome-wide association studies I
    • 7.1 Association studies
    • 7.2 GWAS is just linear regression
    • 7.3 Multiple testing
    • 7.4 LD and GWAS
    • 7.5 Imputation
    • 7.6 QQ plots
    • 7.7 Manhattan plots
    • 7.8 Sample size
    • 7.9 Interpreting GWAS results
    • 7.10 Conclusion
    • 7.11 Homework
  • 8 Genome-wide association studies II
    • 8.1 Setup
      • 8.1.1 R packages
    • 8.2 Data
    • 8.3 Variant Call Format (VCF)
    • 8.4 VCF header
    • 8.5 VCF data
    • 8.6 Reading in genotype data
    • 8.7 Tidying VCF
    • 8.8 Counting allele dosage
    • 8.9 Phenotype data
    • 8.10 Merging genotype and phenotype data
    • 8.11 GWAS for one variant
    • 8.12 Genotype-phenotype boxplots
    • 8.13 Linear regression
    • 8.14 GWAS for multiple SNPs
    • 8.15 GWAS of one SNP with PLINK
    • 8.16 GWAS of all SNPs with PLINK
    • 8.17 Plotting GWAS results
    • 8.18 Top GWAS SNP
    • 8.19 Conclusion
    • 8.20 Homework
  • 9 Scans for selection
    • 9.1 Signatures of positive selection
    • 9.2 Frequency-based signatures
    • 9.3 Haplotype-based signatures
    • 9.4 Setup
      • 9.4.1 R packages
    • 9.5 The FST statistic
    • 9.6 Data (for FST)
    • 9.7 The genetic_diff function
    • 9.8 Calculating FST
    • 9.9 Distribution of GST across the genome
    • 9.10 Top GST hits
    • 9.11 Viewing GST hits in GGV
    • 9.12 Population branch statistic
    • 9.13 Calculating PBS
    • 9.14 Data (for PBS)
    • 9.15 Reading in PBS data
    • 9.16 Calculating PBS
    • 9.17 Manhattan plot of PBS results
    • 9.18 Top PBS hits
    • 9.19 Plotting PBS trees
    • 9.20 Extended haplotype homozygosity
    • 9.21 Plotting EHH
    • 9.22 Integrated haplotype statistic
    • 9.23 The PopHuman browser
    • 9.24 Conclusion
    • 9.25 Homework
  • 10 Archaic admixture
    • 10.1 Neanderthal and Denisovan introgression
    • 10.2 Inferring introgression from phylogenetic trees
    • 10.3 Incomplete lineage sorting
    • 10.4 Evidence of introgression
    • 10.5 The \(D\) statistic
    • 10.6 Setup
      • 10.6.1 R packages
    • 10.7 Data
    • 10.8 Reading in data
    • 10.9 The d() function
    • 10.10 Computing the D statistic
    • 10.11 Converting to p-values
    • 10.12 Computing D for all populations
    • 10.13 Plotting the D statistic
    • 10.14 \(f_{4}\) statistic
    • 10.15 \(f_{4}\)-ratio statistic
    • 10.16 Plotting \(f_{4}\)-ratio results
    • 10.17 Computing statistics in genomic intervals
    • 10.18 BED files
    • 10.19 Region-specific \(f_4\) ratio
    • 10.20 Conclusion
    • 10.21 Homework
  • 11 Gene expression
    • 11.1 Gene expression
    • 11.2 The Genotype-Tissue Expression project
    • 11.3 GTEx portal
    • 11.4 Genetic effects on gene expression
    • 11.5 Expression QTLs
    • 11.6 eQTLs in the GTEx Portal
    • 11.7 Splicing QTLs
    • 11.8 Setup
      • 11.8.1 R packages
    • 11.9 Data
    • 11.10 Differential gene expression
    • 11.11 Conclusion
    • 11.12 Homework
  • 12 Coronavirus phylogenetics
    • 12.1 Phylogenetic trees
    • 12.2 Nextstrain
    • 12.3 Incomplete sampling
    • 12.4 Tracking SARS-CoV-2 with phylogenetics
    • 12.5 SARS-CoV-2 mutation landscape
    • 12.6 Setup
      • 12.6.1 R packages
    • 12.7 Data
    • 12.8 Neighbor joining trees
    • 12.9 Computing pairwise distance
    • 12.10 Building a phylogenetic tree
    • 12.11 Conclusion
    • 12.12 Homework
  • Authors
  • Published with bookdown & the OTTR template)

    Style adapted from: rstudio4edu-book (CC-BY 2.0)

Human Genome Variation Lab

8.16 GWAS of all SNPs with PLINK

Now let’s allow PLINK to run the statistical tests for all SNPs by removing the --snp flag.

system(command = "./plink --file genotypes --linear --allow-no-sex --pheno GS451_IC50.txt --pheno-name GS451_IC50")

The plink.assoc.linear file should now have ~260,000 lines. Load the file into R to look at the results:

results <- read.table(file = "plink.assoc.linear",
                      header = TRUE) %>%
  # order table by lowest pvalue
  arrange(P)

head(results)
##   CHR        SNP       BP A1 TEST NMISS   BETA   STAT         P
## 1  19  rs7257475 20372113  T  ADD    88 -3.008 -6.876 9.311e-10
## 2  19 rs10413538 20370690  T  ADD    86 -3.026 -6.805 1.395e-09
## 3  21  rs2826383 20844081  A  ADD   166  3.031  5.866 2.392e-08
## 4  19 rs12972967 20358400  T  ADD    89 -2.421 -5.939 5.760e-08
## 5   2  rs1358578 51626897  A  ADD   166  2.111  5.307 3.571e-07
## 6  17  rs3094508 33137048  C  ADD    89  3.532  5.230 1.156e-06

All illustrations CC-BY.
All other materials CC-BY unless noted otherwise.