6.5 Genotype data
We’ve summarized genotype data from the 1000 Genomes Project into two files:
all_variants.txt.gz
, which contains a random selection of variants on chr21common_variants.txt
, which contains only those variants inall_variants
that are common across populations
# all variants
all <- read.table("all_variants.txt.gz")
# only common variants
common <- read.table("common_variants.txt.gz")
# preview first 10 columns of `all` dataframe
head(all[, 1:10])
## AF AFR_AF AMR_AF EAS_AF EUR_AF SAS_AF HG00096 HG00097 HG00099
## chr21_10005999 0.02 0.06 0.00 0 0.00 0.00 0 0 0
## chr21_10325486 0.02 0.00 0.01 0 0.03 0.04 0 0 0
## chr21_10336823 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## chr21_10337236 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## chr21_10339129 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## chr21_10339141 0.00 0.00 0.00 0 0.00 0.00 0 0 0
## HG00100
## chr21_10005999 0
## chr21_10325486 0
## chr21_10336823 0
## chr21_10337236 0
## chr21_10339129 0
## chr21_10339141 0
The index of the dataframe is the variant ID.
* The first column (AF
) contains the variant’s allele frequency (AF) dataset-wide.
* The next five columns contain the variant’s AF in each of the five 1000 Genomes superpopulations.
* The rest of the columns provide variant genotypes for each individual in 1000 Genomes, where:
* 0
is homozygous reference
* 1
is heterozygous
* 2
is homozygous for variant