Principal Component Analysis (PCA)
Principal Component Analysis
(PCA) is an orthogonal linear transformation that turns a set of possibly
correlated variables into a new set of variables that are as uncorrelated as
possible.
The new variables lie in a
new coordinate system such that the greatest variance is obtained by projecting
the data in the first coordinate, the second greatest variance by projecting in
the second coordinate, and so on. These new coordinates are called principal
components.
NOTE: we have as many principal components as the number of original dimensions, but we keep only those with high variance.
PCA will allow us to reduce a high-dimensional space into a low-dimensional one while preserving as much variance as possible. It is an unsupervised method since it does not need a target class to perform its transformations, it only relies on the values of the learning attributes.
In this post, two major purposes of PCA will be addressed:
- 1. Visualization : Projecting a high-dimensional space, for example, into two dimensions will allow us to map our instances into a two-dimensional graph. Using these graphical visualizations, we can have insights about the distribution of instances and look at how separable instances from different classes are. In this section PCA will be used to transform and visualize a dataset.
- 2. Feature Selection : Since PCA can transform instances from high to lower dimensions, we could use this method to address the curse of dimensionality. Instead of learning from the original set of features, we can transform our instances with PCA and then apply a learning algorithm on top of the new feature space.
Example:
Say
we have 100’s of features in the original dataset. After applying PCA, we notice
that the top 20 features explains more than 95% of the variance of the target
variable. So, we can train a model with only those 20 features, since the
remaining features hardly contributes to the variance.
Visualization with PCA
Visualizing high dimensions at the same time is
impossible for a human being, so we will use PCA to reduce the instances to two
dimensions and visualize its distribution in a 2-d
scatter
graph.
Lets
look at the IRIS dataset. It has 4 dimensions and we will transform the data
into 2 dimensions:
Now
from plot, a 2D transformation
cleanly separates setosa from versicolor and virginica (but not versicolor
and virginica, which in fact requires
another dimension).
Feature Selection with PCA
To use PCA as
a feature selection technique, we need to calculate the explained variance of
each feature, i.e. how much variance does the features explains to the target
variable.
In
Scikit-learn, this can be easily obtained from “pca.explained_variance_ratio_”.
In our case,
we have 60 features and after pca transformation, the following graph shows how
each dimension contributes to the variance.
1st Dimension = 20.3%
2nd Dimension = 18.8%
So, the first 2 dimensions contributes a
total of 39.2% and go on.
Now we need to select features
for our predictive model. To do that, I will select those top features whose
cumulative sum corresponds to more than 95% of the variance.
The following graph shows the
cumulative sums of all the dimensions.
As we can see, the top 30 dimensions
explains more than 95% of the variance and the remaining hardly contributes.
So, my new feature-set will be those top 30 features that will be used to train the model.
Making Predictions:
Lets use a model to make predictions. At first, using
the original features and then with
the PCA transformed feature set.
As you can see, the model performed better with the
PCA transformed feature set.
Also, with
the original feature-set, the model was over-fitting as the training and
testing scores were very
far apart, which in case of the other, were reduced.
Comments
Post a Comment