6.11 Principal component analysis
Principal component analysis (PCA) is a method for representing high-dimensional data (i.e., data with many variables) within a smaller number of dimensions. In our case, every individual in the VCF has genotype measurements for hundreds of variants.
You can think of PCA as a projection all the individuals in our dataset into a cloud, where their position is determined by their combination of genotypes.
- The first principal component (PC) is the vector through the cloud of data points that captures the greatest possible variance.
- The second PC is the vector that captures the second greatest possible variance, and must also be perpendicular to the first vector.
- The same idea applies to the third, fourth, fifth, etc. PCs.
For an in-depth visual walkthrough of PCA, you can go to this website.