Saturday 29 November 2014

Polynomial Regression, Overfitting, Underfitting, Choosing polynomial degree

Overview:

1. Overview of Linear Regression.
2. Polynomial Regression.
3. Choose the best degree of polynomial.
4. Overfitting (High variance) and Underfitting (High bias).
5. Errortraining_data Vs  Errorcross-validation  - determining polynomial degree.
6. Pseudo Code for Ocatve.


Let's Begin:


Linear Regression: 
Linear regression is an approach for modeling one or more independent attributes (x1, x2, x3, ….. xn) and the dependent attribute (y).
                                                   
We can also say linear regression is the technique to fit a best straight line to our data set regardless of the dimension. But sometimes, a straight line wouldn’t be the best plot for the data given; such situations may demand a curve as the best fit to the data.
i.e. The squared error with curve <<  The squared error with the linear line
Hence to plot a curve we use the technique widely known as polynomial regression

For more info on implementing Linear regression:  


Polynomial Regression: 
It is a type of linear regression in which the relationship between the independent attributes (x1, x2, x3, ….. ,xn) and the dependent attribute (y) is modeled as an nth degree polynomial.

                                        
For Example look at the plots below:


        Fig-1

In the above plot we see that, for the data set given a straight line (linear) is not the best fit because the error is more, but the curve in the RHS fits the data set good with very less error.




How to Choose the degree of Polynomial:
Now since we have known what is polynomial regression, it is even more important to know how to choose the degree of the polynomial.
For a 2D data set which can be easily visualized- by seeing the data set one can tell approximately what degree of polynomial will fit the data best.

For example by seeing the above data set in Fig-1, one can easily say that a 2nd order or 4th order polynomial curve would best fit the above data set, but again that's just a guess.

The polynomial degree for the above data set is easy to guess because its a 2D data set, what about a data set that is higher in dimension probably 4D or 5D (4 or 5 feature dataset), knowing the fact that we can't visualize a 4D or 5D or higher dimensional dataset,  guessing the degree of polynomial is not recommended. So what do we do?

In such a case we would train our data set with different polynomial degrees and see- using which degree of polynomial the error or j_theta is minimum for both training data and cross-validation data. Despite such simple solution in doing so we need to take care of overfitting and underfitting which we shall cover below.





Overfitting and Underfitting:
It is best that we understand Overfitting and Underfitting with plots and example.

Let us take a simple example: suppose we have a 2D data set as given below with X as the independent attribute and Y as the dependent attribute on X or the output or class. We divide the data set in two parts 
1. Training Data (TR) - The data on which we train the algorithm.
2. Cross-validation data (CV) - We use the parameters (weight) derived from the training dataset multiply it to the attributes of the cross-validation data and compare the hypothesis to the original output.






We use gradient descent or normal equation on the 2D data given above to find the theta value for both training data set and cross-validation data set for different polynomial degrees (2nd, 4th, 6th) and plot the hypothesis curve for each.

Given below are the plots of the hypothesis curve for both training data and the cross-validation data set
Plotting the hypothesis curve with linear, 2nd, 4th and 6th degree polynomial on Training Data:
Fig-2

Plotting the hypothesis curve with linear, 2nd, 4th and 6th degree polynomial on Cross-validation  Data:
Fig-3
What's with the plots:
The 1st plot of Fig-2 is a hypothesis line which is a situation of underfitting because it doesn't fit the training data well and the error (j_theta) is high for the training data itself, whereas in the 1st plot Fig-3 for the cross-validation data the hypothesis line fits okay but also has some error (j_theta).

The 4th plot of Fig-2 is the hypothesis curve for 6th degree polynomial regression, it fits the training data very well (it touches all the point) and gives zero error (j_theta), whereas in the 4th plot of Fig-3 it fails to fit the cross-validation data giving maximum error. This situation is called overfitting- where the hypothesis curve fits the training data very well with almostzero error but fails to give a good hypothesis for a new data.                                           
 errortraining_data <<<   errorcross-validation

The 3rd plot of Fig-2 is a hypothesis curve for 4th degree polynomial and fits the training data very well with negligible error, and also in Fig-3 it completely fits the cross-validation data with approximately zero error. hence we conclude that using 4th degree polynomial for a new data set would give best hypothesis than any other degree polynomial.
Conclusion drawn- underfitting and overfitting plays an important role while choosing the degree of polynomial. 




Errortraining_data Vs Errorcross-validation  - determining polynomial degree.
The above 2D data set- by visualizing the data or the hypothesis curve one can easily tell what order of polynomial would best fit the data set, but when it comes to multidimensional data set since the data cannot be visualized it becomes challenging to select a polynomial degree for the data set.

In such a case we find the error for multiple polynomial degrees for both the training data and the cross-validation data and select the one with minimum error on both data set.

Intuition:
Considering the above data set, For training data on which we train our algorithm- the error (cost function j_theta) decreases as the polynomial degree increases. But for the cross-validation data on which we apply the theta value (parameters) obtained from training the training data set- the error (cost function j_theta) decreases till certain polynomial degree and then increases usuriously.

Below plot will make the intuition clear:

 
If you see the plot in the  LHS the blue line shows that the error decreases as the degree of polynomial increases for training data set. But for the cross-validation data set the red line indicates that the error decreases till it reaches its minimum at 4th degree and then the error increases exorbitantly.

The above plot draws a conclusion that 4th degree polynomial best fits our training as well as cross-validation data considering the error driven in both case.
Hence plotting degree of polynomial to error function plot one can decide the degree of polynomial best fit to their data set.



Pseudo Code for Octave:
clear all;clc;

1. Loading the training data set    
    x=load('tr_x.ext');
    X=x;
    y = load ('tr_y.ext');
    m=length(y); 

2. Take polynomial degrees for the X
    x1=[ones(m,1),X];
    x2=[ones(m,1),X,X.^2];  
    x4=[ones(m,1),X,X.^2,X.^3,X.^4];
    x6=[ones(m,1),X,X.^2,X.^3,X.^4,X.^5,X.^6];
    x8=[ones(m,1),X,X.^2,X.^3,X.^4,X.^5,X.^6,X.^7,X.^8];
    [m,n]=size(x);

3. Loading the cross validation data set

    x_cv= load ('cv_x.ext');
    X_cv=x_cv;
    y_cv= load ('cv_y.ext');
    m_cv=length(y_cv);

4. Take polynomial degrees for the cross_validation data
    x1_cv=[ones(m_cv,1),X_cv];
    x2_cv=[ones(m_cv,1),X_cv,X_cv.^2];
    x4_cv=[ones(m_cv,1),X_cv,X_cv.^2,X_cv.^3,X_cv.^4];
    x6_cv=[ones(m_cv,1),X_cv,X_cv.^2,X_cv.^3,X_cv.^4,X_cv.^5,X_cv.^6];
    x8_cv=[ones(m_cv,1),X_cv,X_cv.^2,X_cv.^3,X_cv.^4,X_cv.^5,X_cv.^6,X_cv.^7,X_cv.^8];
    [m_cv,n_cv]=size(x_cv);

5. Take x vals with to plot curve
    X_vals = (min(X):0.05:max(X))';

6. Take polynomial degrees for the cross_validation data
    x_vals1=[ones(size(X_vals)),X_vals];
    x_vals2=[ones(size(X_vals)),X_vals,X_vals.^2];
    x_vals4=[ones(size(X_vals)),X_vals,X_vals.^2,X_vals.^3,X_vals.^4];
    x_vals6=[ones(size(X_vals)),X_vals,X_vals.^2,X_vals.^3,X_vals.^4,X_vals.^5,X_vals.^6];
    x_vals8=[ones(size(X_vals)),X_vals,X_vals.^2,X_vals.^3,X_vals.^4,X_vals.^5,X_vals.^6,X_vals.^7,X_vals.^8];

 

7. Algorithm
    a={x1,x2,x4,x6,x8};
    b={x_vals1,x_vals2,x_vals4,x_vals6,x_vals8};
    c={x1_cv,x2_cv,x4_cv,x6_cv,x8_cv};

    for i=1:length(a);
        x=a{i};                   % Training data set
        x_cv=c{i};             % Cross validation data set
        x_vals=b{i};          % x_vals feature data set

        % Finding theta using normal equation 
        theta=(x'*x)\ x'*y;
        h_tr=x * theta;
        h_cv=x_cv * theta;

        % Capturing the error 
        j_theta_tr(i)=(0.5/m).* (h_tr - y)' * (h_tr - y);
        j_theta_cv(i)=(0.5/m_cv).* (h_cv - y_cv)' * (h_cv- y_cv);
   
   
        % Calculating hypothesis curve using the theta and x_vals
        h_vals_tr=x_vals * theta;

        % Plotting hypothesis curve for training dataset 2D
        figure;
        plot(x(:,2), y, 'o', 'MarkerFacecolor', 'r', 'MarkerSize', 8);
        hold on
        plot (x_vals(:,2), h_vals_tr,'b', 'MarkerSize', 8)
        hold off
   
        % Plotting training data's hypothesis curve for cross-validation dataset 2D
        figure;
        plot(x_cv(:,2), y_cv, 'o', 'MarkerFacecolor', 'r', 'MarkerSize', 8);
        hold on
        plot (x_vals(:,2), h_vals_tr,'b', 'MarkerSize', 8)
        hold off
    end

% Plotting the error for training and cross-validation data set
    figure;
    plot ([1,2,4,6,8],j_theta_tr,'b', 'MarkerSize',13)
    xlabel('Polynomial Degree')
    ylabel('j_tr (Cost function training data)')
    figure;
    plot ([1,2,4,6,8],j_theta_cv,  'r', 'MarkerSize', 13)
    xlabel('Polynomial Degree')
    ylabel('j_cv (Cost function crossvalidation data)')
    hold off