6.11 Principal component analysis

Principal component analysis (PCA) is a method for representing high-dimensional data (i.e., data with many variables) within a smaller number of dimensions. In our case, every individual in the VCF has genotype measurements for hundreds of variants.

You can think of PCA as a projection all the individuals in our dataset into a cloud, where their position is determined by their combination of genotypes.

The first principal component (PC) is the vector through the cloud of data points that captures the greatest possible variance.
The second PC is the vector that captures the second greatest possible variance, and must also be perpendicular to the first vector.
The same idea applies to the third, fourth, fifth, etc. PCs.

Fig. 3. A PCA plot that simplifies three-dimensional data into two dimensions.

For an in-depth visual walkthrough of PCA, you can go to this website.