11.9 Data

The GTEx Portal provides links for downloading curated and summarized forms of its data, including giant matrices that encode the expression of every gene across all samples and tissues.

For ease of manipulation in R, we’ve subset this data to 150 samples, highly expressed genes, and only data from liver and lung tissue.

gtex <- read.table("gtex_subset.txt.gz", header = TRUE)

head(gtex)
##       Sample   Age Sex Death_Hardy Tissue            Gene_ID Gene_Name Counts
## 1 GTEX-111YS 60-69   M           0   Lung ENSG00000187634.11    SAMD11     59
## 2 GTEX-111YS 60-69   M           0   Lung ENSG00000188976.10     NOC2L   2789
## 3 GTEX-111YS 60-69   M           0   Lung ENSG00000187961.13    KLHL17    716
## 4 GTEX-111YS 60-69   M           0   Lung ENSG00000187583.10   PLEKHN1     47
## 5 GTEX-111YS 60-69   M           0   Lung  ENSG00000187642.9     PERM1     23
## 6 GTEX-111YS 60-69   M           0   Lung ENSG00000188290.10      HES4    534

The columns of this dataframe are:

  • Sample: Individual sequenced
  • Age: Individual’s age range
  • Sex: Individual’s sex
  • Death_Hardy: Individual’s cause of death, measured on the Hardy Scale
  • Tissue: Tissue measured
  • Gene_ID: Ensembl gene ID
  • Gene_Name: The common gene name
  • Counts: Expression level for the gene
    • Ex: GTEX-111YS has 59 sequencing reads that mapped to the SAMD11 gene

Data normalization

The expression levels in this table have been normalized to account for factors such as sequencing variation between samples – i.e., if we collected more sequencing data from one individual than another.