6.5 Genotype data

We’ve summarized genotype data from the 1000 Genomes Project into two files:

  • all_variants.txt.gz, which contains a random selection of variants on chr21
  • common_variants.txt, which contains only those variants in all_variants that are common across populations
# all variants
all <- read.table("all_variants.txt.gz")

# only common variants
common <- read.table("common_variants.txt.gz")

# preview first 10 columns of `all` dataframe
head(all[, 1:10])
##                  AF AFR_AF AMR_AF EAS_AF EUR_AF SAS_AF HG00096 HG00097 HG00099
## chr21_10005999 0.02   0.06   0.00      0   0.00   0.00       0       0       0
## chr21_10325486 0.02   0.00   0.01      0   0.03   0.04       0       0       0
## chr21_10336823 0.00   0.00   0.00      0   0.00   0.00       0       0       0
## chr21_10337236 0.00   0.00   0.00      0   0.00   0.00       0       0       0
## chr21_10339129 0.00   0.00   0.00      0   0.00   0.00       0       0       0
## chr21_10339141 0.00   0.00   0.00      0   0.00   0.00       0       0       0
##                HG00100
## chr21_10005999       0
## chr21_10325486       0
## chr21_10336823       0
## chr21_10337236       0
## chr21_10339129       0
## chr21_10339141       0

The index of the dataframe is the variant ID. * The first column (AF) contains the variant’s allele frequency (AF) dataset-wide. * The next five columns contain the variant’s AF in each of the five 1000 Genomes superpopulations. * The rest of the columns provide variant genotypes for each individual in 1000 Genomes, where: * 0 is homozygous reference * 1 is heterozygous * 2 is homozygous for variant