Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is an orthogonal linear transformation that turns a set of possibly correlated variables into a new set of variables that are as uncorrelated as possible.

The new variables lie in a new coordinate system such that the greatest variance is obtained by projecting the data in the first coordinate, the second greatest variance by projecting in the second coordinate, and so on. These new coordinates are called principal components.

NOTE:  we have as many principal components as the number of original dimensions, but we keep only those with high variance.

PCA will allow us to reduce a high-dimensional space into a low-dimensional one while preserving as much variance as possible. It is an unsupervised method since it does not need a target class to perform its transformations, it only relies on the values of the learning attributes.

In this post, two major purposes of PCA will be addressed:
  • 1.   Visualization : Projecting a high-dimensional space, for example, into  two dimensions will allow us to map our instances into a two-dimensional graph. Using these graphical        visualizations, we can have insights about  the distribution of instances and look at how separable instances from  different classes are. In this section PCA will be used to transform and visualize a dataset.
  • 2.   Feature Selection : Since PCA can transform instances from high to lower dimensions, we could use this method to address the curse of dimensionality. Instead of learning from the original set of features, we can transform our instances with PCA and then apply a learning algorithm on top of the new feature space.
Example: 
Say we have 100’s of features in the original dataset. After applying PCA, we notice that the top 20 features explains more than 95% of the variance of the target variable. So, we can train a model with only those 20 features, since the remaining features hardly contributes to the variance.

Visualization with PCA
Visualizing high dimensions at the same time is impossible for a human being, so we will use PCA to reduce the instances to two dimensions and visualize its distribution in a 2-d
scatter graph. 

Lets look at the IRIS dataset. It has 4 dimensions and we will transform the data into 2 dimensions:
Now from plot, a 2D transformation cleanly separates setosa from versicolor and virginica (but not versicolor and virginica, which in fact requires another dimension).

Feature Selection with PCA
To use PCA as a feature selection technique, we need to calculate the explained variance of each feature, i.e. how much variance does the features explains to the target variable.

In Scikit-learn, this can be easily obtained from “pca.explained_variance_ratio_”.

In our case, we have 60 features and after pca transformation, the following graph shows how each dimension contributes to the variance.
          1st Dimension   = 20.3%
          2nd Dimension  = 18.8%        
          So, the first 2 dimensions contributes a total of 39.2% and go on.

Now we need to select features for our predictive model. To do that, I will select those top features whose cumulative sum corresponds to more than 95% of the variance.
The following graph shows the cumulative sums of all the dimensions.


As we can see, the top 30 dimensions explains more than 95% of the variance and the remaining hardly contributes. 

So, my new feature-set will be those top 30 features that will be used to train the model.                   

Making Predictions:
Lets use a model to make predictions. At first, using the original features and then with the PCA transformed feature set.



As you can see, the model performed better with the PCA transformed feature set.  
Also, with the original feature-set, the model was over-fitting as the training and testing scores were very far apart, which in case of the other, were reduced.




Comments

Popular posts from this blog

A/B Testing: Bernoulli Model VS the Beta-Binomial Hierarchical Model

Fraud Detection in Financial Data

Recurrent Neural Networks and LSTM explained