8.13 Linear regression

The function to perform linear regression in R is lm(). It takes as arguments a data frame (gwas_data) and a model formula of the form outcome ~ predictors.

In the case of GWAS, our outcome is the phenotype, and our predictor is the SNP genotype. We may also include covariates such as sex, age, or ancestry as additional predictors (called covariates) to control for their potential confounding effects. No such data are available here, so we just run the simple genotype vs. phenotype test.

# test for association between genotype and phenotype
lm(data = gwas_data,
   formula = GS451_IC50 ~ dosage) %>%
  # directly pipe (%>%) model results to the `summary()` function
  summary()
## 
## Call:
## lm(formula = GS451_IC50 ~ dosage, data = gwas_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0247 -1.9643 -0.3867  2.1967  6.6201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5712     0.3299  19.921   <2e-16 ***
## dosage        1.3846     0.6800   2.036   0.0449 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.659 on 83 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.04757,    Adjusted R-squared:  0.0361 
## F-statistic: 4.146 on 1 and 83 DF,  p-value: 0.04493

How do we interpret the results of the linear model?

The coefficient for dosage indicates that on average, each copy of the “G” allele increases \(\mathrm{IC_{50}}\) by \(1.38\).

The p-value indicates that this slope of \(1.38\) is significantly greater than 0 (\(p = 0.0449\)).



Do you think this SNP would reach genome-wide significance?

This p-value is borderline, sitting very close to the arbitrary cutoff of \(0.05\) which is generally used to determine statistical significance.

If this was the only SNP that we were investigating, we might find this result promising. However, this SNP is just one of hundreds of thousands of SNPs that we will test for association, so the burden of proof will need to be much higher. Recall that the genome-wide significance threshold for GWAS in humans is \(5 * 10^{-8}\).