6.5 Genotype data
We’ve summarized genotype data from the 1000 Genomes Project into two files:
all_variants.txt.gz, which contains a random selection of variants on chr21common_variants.txt, which contains only those variants inall_variantsthat are common across populations
# all variants
all <- read.table("all_variants.txt.gz")
# only common variants
common <- read.table("common_variants.txt.gz")
# preview first 10 columns of `all` dataframe
head(all[, 1:10])## AF AFR_AF AMR_AF EAS_AF EUR_AF SAS_AF HG00096 HG00097 HG00099
## chr21_10005999 0.02 0.06 0.00 0 0.00 0.00 0 0 0
## chr21_10325486 0.02 0.00 0.01 0 0.03 0.04 0 0 0
## chr21_10336823 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## chr21_10337236 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## chr21_10339129 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## chr21_10339141 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## HG00100
## chr21_10005999 0
## chr21_10325486 0
## chr21_10336823 0
## chr21_10337236 0
## chr21_10339129 0
## chr21_10339141 0
The index of the dataframe is the variant ID.
* The first column (AF) contains the variant’s allele frequency (AF) dataset-wide.
* The next five columns contain the variant’s AF in each of the five 1000 Genomes superpopulations.
* The rest of the columns provide variant genotypes for each individual in 1000 Genomes, where:
* 0 is homozygous reference
* 1 is heterozygous
* 2 is homozygous for variant