Gradient Descent for Logistic Regression

Implementation of Gradient Descent for optimizing Logistic Regression Cost Function

Assumptions:
1. For the sake of simplicity, assume that there are only two features (the algorithm will generalize over  m training examples). Vectorized notations will take care of multiple features and training examples.
2. Also, to make the notations simple, the derivative of the cost function $$ \frac{\partial J(w,b)}{\partial x} $$ with respect to a variable ‘x’ will be written as $$ dx $$

 

Logistic Regression: Derivative calculation with two examples

Input: $$ x_1, x_2 $$
Parameters: $$ w_1, w_2, b $$

$$ z = w_1x_1 + w_2x_2+b $$ –> $$ a = \sigma(z) $$ –> $$ L(a,y) $$

Objective: Calculate the derivative of loss function w.r.t. $$ w_1, w_2 $$ & $$ b $$

Backpropagating Step By Step:

* Calculate $$ \frac{\partial L(a,y)}{\partial a} $$ or $$da$$

$$ \frac{\partial L(a,y)}{\partial a} = – \frac{y}{a} + \frac{1-y}{1-a} $$

* Calculate $$ dz $$

$$ dz = \frac{\partial L}{\partial z} = \frac{\partial L}{\partial a}.\frac{\partial a}{\partial z} $$

$$\frac{\partial a}{\partial z} = a(1-a) $$

$$dz = a-y $$

* Calculate $$ dw_1, dw_2, db $$

$$dw_1 = x_1dz$$
$$dw_2 = x_2dz$$
$$db = dz$$

—————–

Looping over m examples: Pseudocode

Initialize: $$ J=0 ; dw_1=0 ; dw_2=0 ; db=0 $$

for i = 1 to m
$$z^{(i)}$$ = $$w^Tx^{(i)}$$ + $$b$$
$$a^{(i)}$$ = $$\sigma(z^{(i)})$$
$$J+=-(y^{(i)}log a^{(i)} + (1-y^{(i)})log (1-a^{(i)})$$
$$dz^{(i)}$$ = $$a^{(i)}-y^{(i)}$$
$$dw_1+ = x_1^{(i)}dz^{(i)}$$
$$dw_2+ = x_2^{(i)}dz^{(i)}$$
$$db+= dz^{(i)}$$

J/=m
$$dw_1$$/=m
$$dw_2$$/=m
db/=m

$$w_1: w_1 – \alpha dw_1$$
$$w_2: w_2 – \alpha dw_2$$
b = b- $$\alpha$$ db

$$\alpha$$ is the learning rate.

Using Neural Network and Backpropagation to implement Logistic Regression algorithm

Logistic Regression is one of the most used classification technique used in Data Science. Its most probably one of the first few algorithm anyone learns while starting with Data Science or machine learning (think of “Hello World!” while learning a new language).

This post assumes that you are well versed in implementing logistic regression, atleast the basics (I’ll write another post later for basic logistic regression implementation). This is about how to implement Logistic Regression using the backpropagation algorithm and neural network architecture.

The full code can be found at the following Github Repo

We will be going forward as per the following steps:
1. Define the architecture
2. Write the Sigmoid Function
3. Initialize the parameters W and b
4. Write the cost function, and minimize it while learning the parameters
5. Use the learnt parameters to predict new data

Except the basic Numpy and pandas library, we won’t be using anything else and will write each function from scratch.

Write the sigmoid function
Sigmoid function is the one which is used in Logistic Regression, though it is just one of the many activation functions used in the activation layers of a Deep neural network (losing its place to fast alternatives like ReLU – Rectified Linear Unit). A sigmoid function takes input a number and outputs another number between 0 and 1 (great for predicting probabilities)

def sigmoid(z):
sig = 1/(1+np.exp(-1*z))
return sig

Intitialize the parameters W and b
The cost function for logistic regression is represented as

Loss Function for one example:

$$ L(\hat y^{(i)},y^{(i)}) = -[(y^{(i)}log(\hat y^{(i)}) + (1-y^{(i)}log(1-\hat y^{(i)})] $$

Cost Function: Summing over the loss function for m examples:

$$ J(w,b) = -\frac{1}{m}\sum^m_{i=1} [(y^{(i)}log(\hat y^{(i)}) + (1-y^{(i)}log(1-\hat y^{(i)})] $$

W and b and weight matrices applied to the input vector X. Forget the summation in the above cost function, if you are working with matrices, typically a matrix multiplication is used which is essentially the same thing

Input: number of features (dimension of W)

def initialize_with_zeros(dim):
w = np.zeros(dim)
b=0
return w,b

Cost Function, and Forward and backward Propagation
Steps:
1. Write the cost function, perform a forward propagation
2. Find dw and db for later use in backpropagation for Gradient Descent algorithm
3. Calculate cost, return cost and gradients for GD

Forward Propagation and Backward Propagation

def propagate(w,b,X,Y):

m=X.shape[1]#number of training examples

#Forward Propagation
A = sigmoid(np.dot(w.T,X)+b)
cost = (-1/m)*np.sum(Y*np.log(A)+(1-Y)*np.log(1-A))

#BackPropagation: Find individual derivatives of W and B w.r.t. cost for updating weights 
dw = (1/m)*np.dot(X,(A-Y).T)
db = (1/m)*np.sum(A-Y)

cost = np.squeeze(cost)

#Return dw and db in a disctionary for later usage
grads = {"dw":dw,"db":db}

return grads, cost

Optimization function with weights updation for parameters using GradientDescent

This is where the weights are updated and cost is optimized

Parameters:
X: Input data set
Y: Target
W: Weight matrix
B: bias matrix

Num_iterations: Number of iterations
Learning_rate: Alpha: the learning rate of a Gradient Descent Algorithm

Steps:
1. Loop through the number of iterations
2. Forward pass: pass the parameters X,Y, W,b to the Forward propagation algorithm to get grads(dw,db) and cost
3. Update the parameters dw and db
4. Record the cost in the array costs[]
5. After all the iterations, retrieve params (W,b) and grads (dw,db)

def optimize(w,b, X, Y, num_iterations, learning_rate, print_cost=True):
costs=[] # array for storing costs

for i in range(num_iterations):

#Cost and gradient calculation
grads,cost = propagate(w,b,X,Y)

#Retrieve derivatives from backprop algo 
dw = grads["dw"]
db = grads["db"]

#Update the parameters
w = w-learning_rate*dw
b = b-learning_rate*db

#record costs for each 100 iteration
if i%100==0:
costs.append(cost)

#Print Cost
if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" %(i, cost))

#Passthe parameters and grads after the GradDesc Algo is complete
params={"w":w,"b":b}
grads={"dw":dw,"db":db}

return params, grads, costs

Predict on New Data

Now the algorithm is trained on the data and the parameters W and b are optimized, we can use it to predict for a new dataset

def predict(w,b,X):
m=X.shape[1] #number of training examples

Y_pred = np.zeros((1,m))
w = w.reshape(X.shape[0],1)

#Compute the vector A after the first pass
A=sigmoid(np.dot(w.T,X))

for i in range(A.shape[1]):
#Convert probabilities to actual prediction
if(A[0,i]<=0.5):
Y_pred[0,i]=1
else:
Y_pred[0,i]=0

return Y_pred

 

All functions merged into a single function

 

def model(X_train, Y_train, X_test, Y_test, num_iterations=2000, learning_rate = 0.5, print_cost=True):

#Step by step call to other functions

#Intitalization
w, b = initialize_with_zeros(X_train.shape[0])

#Gradient Descent
parameters, grads, costs = optimize(w, b, X_train, Y_train,num_iterations = num_iterations, learning_rate = learning_rate, print_cost = True)

w = parameters["w"]
b = parameters["b"]

# Predict test/train set examples
Y_prediction_test = predict(w,b,X_test)
Y_prediction_train = predict(w,b,X_train)

# Print train/test Errors
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))


d = {"costs": costs,
"Y_prediction_test": Y_prediction_test, 
"Y_prediction_train" : Y_prediction_train, 
"w" : w, 
"b" : b,
"learning_rate" : learning_rate,
"num_iterations": num_iterations}

return d

#Plot Costs
costs = np.squeeze(d['costs'])
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(d["learning_rate"]))
plt.show()

Using Deep learning to create Art

This is probably what will make artists obsolete – using deep learning to generate artwork

There have been news and fear of AI taking over the world (in the long term), or atleast taking over some of the jobs or professions in the near future. Something like Google’s AlphaGo which learnt to play competitive Go by itself, not from other humans.

Image understanding is another field where AI is making big forays ! All this became possible after a revolutionary invention by Geoffery Hinton – the Convolution Neural Networks (CNN). CNNs turn an image into multiple layers of matrices and then deploy a learning algorithm to learn the patterns. It has actually become too good at a lot of image recognition stuff.

The same CNN when applied to Artwork can learn the patterns (or the painiting style) and it can be used to convert any image or photograph into a painting using the ‘painting style’ picked up by the CNN. This requires a content image (the picture which needs to be transformed), a style image (the painting whose style is to be applied to the content image) and some parameters (weights of content and style, speed of algorithm, learning rate, etc). All this generates and image which is a blend of content + style.

Below are some of the images after blending a content and style – styles are picked randomly from google search (as I am illiterate when it comes to art) and the content is a well publicized image.

I’ll be posting the methodology, details, references, source and codes later (will post link).