principal component analysis


principal component analysis

In: 1

When you take a bunch of measurements on something, principal components analysis looks for variables that all move as the result of one or more underlying factors. For example, if you take measurements of someone’s body (ie you measure the length of their arms, the length of their legs, width of their wrist, etc) you would see that a lot of those individual measurements are all correlated with one another as a result of being related to one underlying factor, overall body size. PCA helps identify these intercorrellations.

You can use this in a lot of situations to reduce the number of variables you have to work with, ie combine a bunch of variables into one variable that sort of contains them all.

You collect data on the age and size of children. These two quantities are correlated; older children tend to be taller. So if you make a scatter plot of age against size, you’ll see a cloud of points roughly following an upright diagonal. PCA is a mathematical operation to rotate the coordinate system such that the new x-axis runs directly along that diagonal. Now you have a lot of variance along the x-axis, and less along the y-axis.

For two dimensions, this isn’t tremendously important, but if you have highly correlated data in a lot of dimensions (say, hundreds of genetic markers, or spectral data sampled at hundreds of wavelengths), PCA allows you to rotate the coordinate system such that you can plot the data in the first two dimensions only, and still lose as little information as possible.