Machine Learning and Big Data- The ultimate blend

Welcome to Data Science!! This Blog will try to unveil important concepts involving Big Data and will introduce you to many Machine Learning and Natural Language Processing Algorithms. This Blog is recommended to anybody who is familiar with Machine Learning and Big data or anyone who desires a motivation and wants to take up a challenge and feel the magic of predictive analysis, because believe me, when you start there's no going back.

Sunday 7 December 2014

The Tug of War- ACID, BASE and CAP

For decades there has been a lot of debate around which among ACID, BASE, CAP is the finest approach. With each approach having its merit and demerit the debate seems to be unceasing. People who adhere to traditional databases might argue ACID as the best approach whereas people using distributed systems with terabytes of data to be stored and processed every day might not entertain it despite its fastidiousness.

The approach you choose however depends on your data completely, ACID gets the upper hand when it comes to processing on transactional data on the other hand it fails to provide performance when experienced huge influx of data.

Lets have a tour on each approach, and probably then you might be able to figure out which one is a solution to your problem.

ACID (Atomicity, Consistency, Isolated, Durability):

ACID is the oldest property and almost all relational databases adhere to it. When your data is not too large, the writes aren't that frequent and you want to maintain the data security, integrity and consistency believe me ACID is what you wanna stick to.

Atomicity: All of the operations in the transaction will complete, or none will. In other words each transaction is said to be atomic only when if one fails then all of them fails.
Consider the below example analogical to the scenario:

Begin Transaction

Insert row;

insert row;

update row;

commit;

End Transaction

It say's only when a commit is established the transaction is said to be complete, since commit ensures all the changes to persist. If encountered any disruption in the transaction before the commit is made then no change will persist.

Consistency: The database will be in a consistent state when the transaction begins and ends in other words when a transaction is complete different users connecting to the database will be able to view the same data set. The consistency in ACID also means that preserving all the database rules like unique key constrains , check constraints, Foreign key , primary key constraints.

For example:
"Suppose you have two table with id column present in both. The id column in first table (parent table) is the primary key and the id column in the second table (child table) is the foreign key referring the id column of the first table"
In such a case:

The consistency in ACID wont allow to insert any row in the child table with id that's not present in the parent table.
The consistency in ACID wont allow to delete any row from the parent table if any row in the child table is still referring to that row of parent table.

Isolation: The isolation property of the ACID says that all transaction will behave as if it is the only operation being performed upon the database. The transactions will never interfere.
In other words
"If two users are logged in to the same database with different sessions, with user1 making changes to a table and user2 running a select query on the same table simultaneously. The changes made by user1 on the table wont be reflected immediately to user2 or until a commit is executed and user2 re-run the select query on that table."

Durability: Durability as the name suggest states that upon completion of the transaction, the operation will not be reversed. .
In other words

"Once a commit is made the changes cannot be rolled back."

It also states that the database will survive system failure because before a commit is made on any transaction, many databases maintain logs known as commit logs which contains all the changes for the corresponding transaction, which is run once the database is recuperated after failure.

"The problem with ACID lies on the fact that, when the size of your data increases and that's inevitable- you would wanna have a distributed system where the data is replicated and distributed among different nodes. Adhering to high consistency in such a case wouldn't leave no space for availability- which is another important factor when considering a distributed system"

BASE (Basically available, Soft state, Eventually Consistent):

With distributed system burgeoning all over came the problem of availability. Distributed system complying to ACID found it very hard to provide high availability without compromising its consistency and Isolation (Imagine two users buying the last product on Amazon, In such a case one cannot lock the product until the transaction of one user is complete, the count of books for both the user would be same, in other words both would be given a chance to compete for that product. It is only later one realizes that the product is sold). With the emergence of social media sites, increased the size and velocity of the incoming data. With myriad users connecting simultaneously enterprises could not afford to violate users requests therefore they had to make there system available. Imagine if you want to send an urgent message to your friend who in unreachable over phone, you decide to send the message via Facebook and Facebook says "Our messaging system is currently unavailable, Sorry for the inconvenience". Wouldn't it be frustrating.

BASE was the solution back then which says that I will guarantee a response to the user's request but I wont guarantee the latest data.

Basically Available (BA):

It says that my system will be basically available at any point i.e There will always be a response to an user's request but there is no guarantee that the response might not fail. In case of a response I don't even guarantee that the data fetched would be latest, the data could still be inconsistent. This is where the BA challenges the ACID property.

Eventually Consistent (E):

Eventually Consistency is what many distributed databases function on currently. It says that once a transaction is complete the data across the system will eventually be disseminated. For example when an update in made on a data and a commit is placed, user connecting to that node will be able to see the latest data, whereas the users accessing the copy of data elsewhere might not view the latest snapshot of the data. While the system guarantees that the update will be made on all replicated nodes eventually.

Soft state (S):

The system is said to be in soft state primarily because of its eventually consistency property.

For example consider the above case when a update in made on the data, the node where the change is made might be consistent, where as an user connecting to other nodes holding the replica of the data may see no change at that moment but eventually the changes will be made to all other nodes too. So the other nodes sights a possibility that a re-run might give the updated data.

The system therefore always remains susceptible to change, therefore its always said to be in soft state because you literally don't know when changes are gonna take place. Just Imagine thousand users making changes to the same data.

CAP (Consistency, Availability, Partition Tolerant):

ACID property comes with many drawbacks for Distributed databases and there are lot many uncertainties in BASE therefore most of the distributed databases (NoSQL) today adhere to the CAP property. But CAP property comes with a limitation that at any point or for any transaction the database can only satisfy either of the three CA, CP or AP.

Consistency and Availability (CA) :
The consistency and availability states that all the nodes in the distributed system will be consistent when available. The Consistency in the CAP is very different from that of ACID's consistency. One could call the consistency in CAP as a subset of the consistency in ACID. Here the consistency assures the consistency of only the copies of data.

Case1: Consider the below figure fig-1, let’s say a user is connected to node-1 and he paid his credit cards outstanding balance, after the payment is made the outstanding balance for the user in node-1 gets updated to 0. A consistent system will immediately update the outstanding balance as 0 in node-2 and once both the nodes are consistent only then the transaction goes to a complete state . When the user reconnects to check his outstanding balance and if he is routed to node-2 he sees the updated data.

Fig-1

Case2: Lets say node-2 is down as shown in below fig-2. The update would never take place in node-2 because its down, and our consistency principal says that a transaction could never be complete if the data is not consistent in all the nodes and in the same time we can’t hold the user till the node-2 comes up. On the other hand the user’s request for the transaction cant even be denied (it would violate availability), meanwhile node-1 cant even update and close the transaction because it would result in data inconsistency cause node-2 when brought up would still have the old data which says that user has some outstanding balance- In our case let's assume $100.

Fig-2

Hence in such a case node-1 takes the request from the user updates the outstanding balance and maintains a log which is executed in node-2 as soon as it it available or brought up making the distributed system consistent and available. Now even if the user connects to node-2 he will view the updated data. Since the nodes were connected while one of them was down it doesn't pose a problem in building consistency.

Many Relational Distributed databases like MySQL, PostgreSQL, Aster data and non-relational database like Vertica (Column oriented) works on the CA principal.

Where does CA fails: Distributed databases consists of many servers spread across the cluster may be different data center for different region, In such a case network failure is quite mundane as shown in fig-3. The nodes during network failure are no longer in contact with each other which calls for partition, So the nodes cannot be updated simultaneously due to lack in communication hence the data gets out of sync and unlike Case2 where the nodes didn't loose connection between them, the data here wont guarantee a sync even when the partition is resolved.
Hence the conclusion readily fetched is that in a distributed database consistency, availability and partition tolerant- all cannot be fulfilled at the same time or for a transaction. Network failure being banal prompts distributed systems to have partition tolerant as a mandatory criteria with satisfying either of the two- Availability and Consistency.

Fig-3

Consistency and partition tolerant (CP):
It says that when network partitioning takes place the data in-order to maintain consistency would make the affected nodes unavailable to thwart any transaction till the partition is resolved.

Considering the above scenario, a CP distributed system wont allow user to make any update on any data, until the partition is resolved.

Let us for a moment consider that the user is allowed to pay his outstanding balance, the information therefore in node-1 get updated with user's outstanding balance as 0. After a couple of days when the partition is resolved and node-2 is back in working condition, the user logs in to view his outstanding balance and he is routed to node-2 this time. Suddenly he finds that his outstanding balance is $100 despite he has payed his bills. The user under frustrations sues the bank. This happened because the information never got updated in the node-2.

Many NOSQL databases like Hbase, MongoDB, Bigtable works on the principal of CP.

Availability and Partition Tolerant (AP):
It says that nodes under partition would be fully functional, will be available for any sort of read and write operation even with consistency at stake.

In social media sites with frequent reads and writes to the node if the node is unavailable even for a min, it could deter user’s experience. Such sites demands all their nodes up and running 24/7 (In case of node failure the request would always be routed to a different node containing the replicated data) but in case of a network failure between two nodes which would result in partition between the nodes - the nodes will behave as two independent entities without any communication between each other. The user accessing node-1 can read or write data in node-1 similarly the users connecting to node-2 can read or write data to node-2 without any delay.
But this availability comes at the price of week consistency. The data would never be the same in both the nodes. Re-sync happens once the partition is resolved and nodes are able to communicate but again there's no guarantee that both the nodes will have the same data.

Many distributed databases adhering to the AP property of the CAP provides eventual consistency in the data. The case to similar to that of BASE.
Cassandra is one of the most popular distributed system that adheres to AP with eventually consistency.

" ACID Vs BASE is quite analogical to CP Vs AP with eventual consistency"

Thought for the blog:
"There are many sites probably social media sites that cannot afford compromising availability and consistency both at-least till certain point. In-order to achieve both they take an approach where if a change is made on a node, all the users accessing the manipulated data are routed to that node for some time maybe 20-30 min until all other nodes are consistent. This approach might increase the load on that node for some time but again its at par because the users gets promising consistency"

Saturday 29 November 2014

Polynomial Regression, Overfitting, Underfitting, Choosing polynomial degree

Overview:

1. Overview of Linear Regression.

2. Polynomial Regression.

3. Choose the best degree of polynomial.
4. Overfitting (High variance) and Underfitting (High bias).

5. Errortraining_data Vs Errorcross-validation - determining polynomial degree.

6. Pseudo Code for Ocatve.

Let's Begin:

Linear Regression:

Linear regression is an approach for modeling one or more independent attributes (x¹, x², x³, ….. xⁿ) and the dependent attribute (y).

We can also say linear regression is the technique to fit a best straight line to our data set regardless of the dimension. But sometimes, a straight line wouldn’t be the best plot for the data given; such situations may demand a curve as the best fit to the data.

i.e. The squared error with curve << The squared error with the linear line

Hence to plot a curve we use the technique widely known as polynomial regression

For more info on implementing Linear regression:

http://samlearnsmlbd.blogspot.in/2014/11/linear-regression-gradient-descent-cost.html

Polynomial Regression:

It is a type of linear regression in which the relationship between the independent attributes (x¹, x², x³, ….. ,xⁿ) and the dependent attribute (y) is modeled as an n^thdegree polynomial.

For Example look at the plots below:

Fig-1

In the above plot we see that, for the data set given a straight line (linear) is not the best fit because the error is more, but the curve in the RHS fits the data set good with very less error.

How to Choose the degree of Polynomial:

Now since we have known what is polynomial regression, it is even more important to know how to choose the degree of the polynomial.

For a 2D data set which can be easily visualized- by seeing the data set one can tell approximately what degree of polynomial will fit the data best.

For example by seeing the above data set in Fig-1, one can easily say that a 2nd order or 4th order polynomial curve would best fit the above data set, but again that's just a guess.

The polynomial degree for the above data set is easy to guess because its a 2D data set, what about a data set that is higher in dimension probably 4D or 5D (4 or 5 feature dataset), knowing the fact that we can't visualize a 4D or 5D or higher dimensional dataset, guessing the degree of polynomial is not recommended. So what do we do?

In such a case we would train our data set with different polynomial degrees and see- using which degree of polynomial the error or j_theta is minimum for both training data and cross-validation data. Despite such simple solution in doing so we need to take care of overfitting and underfitting which we shall cover below.

Overfitting and Underfitting:

It is best that we understand Overfitting and Underfitting with plots and example.

Let us take a simple example: suppose we have a 2D data set as given below with X as the independent attribute and Y as the dependent attribute on X or the output or class. We divide the data set in two parts

1. Training Data (TR) - The data on which we train the algorithm.

2. Cross-validation data (CV) - We use the parameters (weight) derived from the training dataset multiply it to the attributes of the cross-validation data and compare the hypothesis to the original output.

We use gradient descent or normal equation on the 2D data given above to find the theta value for both training data set and cross-validation data set for different polynomial degrees (2nd, 4th, 6th) and plot the hypothesis curve for each.

Given below are the plots of the hypothesis curve for both training data and the cross-validation data set

Plotting the hypothesis curve with linear, 2nd, 4th and 6th degree polynomial on Training Data:

Fig-2

Plotting the hypothesis curve with linear, 2nd, 4th and 6th degree polynomial on Cross-validation Data:

Fig-3

What's with the plots:

The 1st plot of Fig-2 is a hypothesis line which is a situation of underfitting because it doesn't fit the training data well and the error (j_theta) is high for the training data itself, whereas in the 1st plot Fig-3 for the cross-validation data the hypothesis line fits okay but also has some error (j_theta).

The 4th plot of Fig-2 is the hypothesis curve for 6th degree polynomial regression, it fits the training data very well (it touches all the point) and gives zero error (j_theta), whereas in the 4th plot of Fig-3 it fails to fit the cross-validation data giving maximum error. This situation is called overfitting- where the hypothesis curve fits the training data very well with almostzero error but fails to give a good hypothesis for a new data.

errortraining_data <<< errorcross-validation

The 3rd plot of Fig-2 is a hypothesis curve for 4th degree polynomial and fits the training data very well with negligible error, and also in Fig-3 it completely fits the cross-validation data with approximately zero error. hence we conclude that using 4th degree polynomial for a new data set would give best hypothesis than any other degree polynomial.

Conclusion drawn- underfitting and overfitting plays an important role while choosing the degree of polynomial.

Errortraining_data Vs Errorcross-validation - determining polynomial degree.

The above 2D data set- by visualizing the data or the hypothesis curve one can easily tell what order of polynomial would best fit the data set, but when it comes to multidimensional data set since the data cannot be visualized it becomes challenging to select a polynomial degree for the data set.

In such a case we find the error for multiple polynomial degrees for both the training data and the cross-validation data and select the one with minimum error on both data set.

Intuition:
Considering the above data set, For training data on which we train our algorithm- the error (cost function j_theta) decreases as the polynomial degree increases. But for the cross-validation data on which we apply the theta value (parameters) obtained from training the training data set- the error (cost function j_theta) decreases till certain polynomial degree and then increases usuriously.

Below plot will make the intuition clear:

If you see the plot in the LHS the blue line shows that the error decreases as the degree of polynomial increases for training data set. But for the cross-validation data set the red line indicates that the error decreases till it reaches its minimum at 4th degree and then the error increases exorbitantly.

The above plot draws a conclusion that 4th degree polynomial best fits our training as well as cross-validation data considering the error driven in both case.
Hence plotting degree of polynomial to error function plot one can decide the degree of polynomial best fit to their data set.

Pseudo Code for Octave:

clear all;clc;

1. Loading the training data set
    x=load('tr_x.ext');
    X=x;
    y = load ('tr_y.ext');
    m=length(y);

2. Take polynomial degrees for the X
    x1=[ones(m,1),X];
    x2=[ones(m,1),X,X.^2];
    x4=[ones(m,1),X,X.^2,X.^3,X.^4];
    x6=[ones(m,1),X,X.^2,X.^3,X.^4,X.^5,X.^6];
    x8=[ones(m,1),X,X.^2,X.^3,X.^4,X.^5,X.^6,X.^7,X.^8];
    [m,n]=size(x);

3. Loading the cross validation data set
    x_cv= load ('cv_x.ext');
    X_cv=x_cv;
    y_cv= load ('cv_y.ext');
    m_cv=length(y_cv);

4. Take polynomial degrees for the cross_validation data
    x1_cv=[ones(m_cv,1),X_cv];
    x2_cv=[ones(m_cv,1),X_cv,X_cv.^2];
    x4_cv=[ones(m_cv,1),X_cv,X_cv.^2,X_cv.^3,X_cv.^4];
    x6_cv=[ones(m_cv,1),X_cv,X_cv.^2,X_cv.^3,X_cv.^4,X_cv.^5,X_cv.^6];
    x8_cv=[ones(m_cv,1),X_cv,X_cv.^2,X_cv.^3,X_cv.^4,X_cv.^5,X_cv.^6,X_cv.^7,X_cv.^8];
    [m_cv,n_cv]=size(x_cv);

5. Take x vals with to plot curve
    X_vals = (min(X):0.05:max(X))';

6. Take polynomial degrees for the cross_validation data
    x_vals1=[ones(size(X_vals)),X_vals];
    x_vals2=[ones(size(X_vals)),X_vals,X_vals.^2];
    x_vals4=[ones(size(X_vals)),X_vals,X_vals.^2,X_vals.^3,X_vals.^4];
    x_vals6=[ones(size(X_vals)),X_vals,X_vals.^2,X_vals.^3,X_vals.^4,X_vals.^5,X_vals.^6];
    x_vals8=[ones(size(X_vals)),X_vals,X_vals.^2,X_vals.^3,X_vals.^4,X_vals.^5,X_vals.^6,X_vals.^7,X_vals.^8];

7. Algorithm
    a={x1,x2,x4,x6,x8};
    b={x_vals1,x_vals2,x_vals4,x_vals6,x_vals8};
    c={x1_cv,x2_cv,x4_cv,x6_cv,x8_cv};

    for i=1:length(a);
      x=a{i};                 % Training data set
x_cv=c{i};           % Cross validation data set
x_vals=b{i};          % x_vals feature data set

        % Finding theta using normal equation
      theta=(x'*x)\ x'*y;
      h_tr=x * theta;
        h_cv=x_cv * theta;

        % Capturing the error
      j_theta_tr(i)=(0.5/m).* (h_tr - y)' * (h_tr - y);
      j_theta_cv(i)=(0.5/m_cv).* (h_cv - y_cv)' * (h_cv- y_cv);


      % Calculating hypothesis curve using the theta and x_vals
      h_vals_tr=x_vals * theta;

        % Plotting hypothesis curve for training dataset 2D
        figure;
        plot(x(:,2), y, 'o', 'MarkerFacecolor', 'r', 'MarkerSize', 8);
        hold on
        plot (x_vals(:,2), h_vals_tr,'b', 'MarkerSize', 8)
        hold off

        % Plotting training data's hypothesis curve for cross-validation dataset 2D
        figure;
        plot(x_cv(:,2), y_cv, 'o', 'MarkerFacecolor', 'r', 'MarkerSize', 8);
        hold on
        plot (x_vals(:,2), h_vals_tr,'b', 'MarkerSize', 8)
        hold off
    end

% Plotting the error for training and cross-validation data set
    figure;
    plot ([1,2,4,6,8],j_theta_tr,'b', 'MarkerSize',13)
    xlabel('Polynomial Degree')
    ylabel('j_tr (Cost function training data)')
    figure;
    plot ([1,2,4,6,8],j_theta_cv, 'r', 'MarkerSize', 13)
    xlabel('Polynomial Degree')
    ylabel('j_cv (Cost function crossvalidation data)')
    hold off