Regression in Data Science

Linear regression with 1 variable
Linear regression equation
Linear regression with 2 variables -centroid- with example-Read it multiple times
Linear regression - Good? Best fit?
Linear regression by nptel
Linear regression manual example

Creating a simple machine learning model

Create a Linear Regression Model in Python using a randomly created data set.

Linear Regression Model
Linear regression geeks for geeks

Generating the Training Set

filter_noneedit

play_arrow

brightness_4

# python library to generate random numbers
from random import randint
 
# the limit within which random numbers are generated
TRAIN_SET_LIMIT = 1000
 
# to create exactly 100 data items
TRAIN_SET_COUNT = 100
 
# list that contains input and corresponding output
TRAIN_INPUT = list()
TRAIN_OUTPUT = list()
 
# loop to create 100 data  items with three columns each
for i in range(TRAIN_SET_COUNT):
    a = randint(0, TRAIN_SET_LIMIT)
    b = randint(0, TRAIN_SET_LIMIT)
    c = randint(0, TRAIN_SET_LIMIT)
 
# creating the output for each data item
    op = a + (2 * b) + (3 * c)
    TRAIN_INPUT.append([a, b, c])
 
# adding each output to output list
    TRAIN_OUTPUT.append(op)

Machine Learning Model – Linear Regression

The Model can be created in two steps:-
1. Training the model with Training Data
2. Testing the model with Test Data

Training the Model
The data that was created using the above code is used to train the model

filter_noneedit

play_arrow

brightness_4

# Sk-Learn contains the linear regression model
from sklearn.linear_model import LinearRegression
 
# Initialize the linear regression model
predictor = LinearRegression(n_jobs =-1)
 
# Fill the Model with the Data
predictor.fit(X = TRAIN_INPUT, y = TRAIN_OUTPUT)

Testing the Data
The testing is done Manually. Testing can be done using some random data and testing if the model gives the correct result for the input data.

filter_noneedit

play_arrow

brightness_4

# Random Test data
X_TEST = [[ 10, 20, 30 ]]
 
# Predict the result of X_TEST which holds testing data
outcome = predictor.predict(X = X_TEST)
 
# Predict the coefficients
coefficients = predictor.coef_
 
# Print the result obtained for the test data
print('Outcome : {}\nCoefficients : {}'.format(outcome, coefficients))

The Outcome of the above provided test-data should be, 10 + 20*2 + 30*3 = 140.
Output

Outcome : [ 140.]
Coefficients : [ 1. 2. 3.]

Linear Regression (Python Implementation)

This article discusses the basics of linear regression and its implementation in Python programming language.

Linear regression is a statistical approach for modelling relationship between a dependent variable with a given set of independent variables.

Note: In this article, we refer dependent variables as response and independent variables as features for simplicity.


In order to provide a basic understanding of linear regression, we start with the most basic version of linear regression, i.e. Simple linear regression.

Simple Linear Regression

Simple linear regression is an approach for predicting a response using a single feature.

It is assumed that the two variables are linearly related. Hence, we try to find a linear function that predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x).

Let us consider a dataset where we have a value of response y for every feature x:

For generality, we define:

x as feature vector, i.e x = [x_1, x_2, …., x_n],

y as response vector, i.e y = [y_1, y_2, …., y_n]

for n observations (in above example, n=10).

A scatter plot of above dataset looks like:-


Now, the task is to find a line which fits best in above scatter plot so that we can predict the response for any new feature values. (i.e a value of x not present in dataset)

This line is called regression line.

The equation of regression line is represented as:

 h(x_i) = \beta _0 + \beta_1x_i

Here,

  • h(x_i) represents the predicted response value for ith observation.
  • b_0 and b_1 are regression coefficients and represent y-intercept and slope of regression line respectively.

To create our model, we must “learn” or estimate the values of regression coefficients b_0 and b_1. And once we’ve estimated these coefficients, we can use the model to predict responses!

In this article, we are going to use the Least Squares technique.

Now consider:

 y_i = \beta_0 + \beta_1x_i + \varepsilon_i = h(x_i) + \varepsilon_i \Rightarrow \varepsilon_i = y_i -h(x_i)


Here, e_i is residual error in ith observation.
So, our aim is to minimize the total residual error.

We define the squared error or cost function, J as:
 J(\beta_0,\beta_1)= \frac{1}{2n} \sum_{i=1}^{n} \varepsilon_i^{2}

and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!

Without going into the mathematical details, we present the result here:

 \beta_1 = \frac{SS_{xy}}{SS_{xx}}

 \beta_0 = \bar{y} - \beta_1\bar{x}

where SS_xy is the sum of cross-deviations of y and x:
 SS_{xy} = \sum_{i=1}^{n} (x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^{n} y_ix_i - n\bar{x}\bar{y}

and SS_xx is the sum of squared deviations of x:
 SS_{xx} = \sum_{i=1}^{n} (x_i-\bar{x})^2 = \sum_{i=1}^{n}x_i^2 - n(\bar{x})^2

Note: The complete derivation for finding least squares estimates in simple linear regression can be found here.

Given below is the python implementation of above technique on our small dataset:

filter_noneedit

play_arrow

brightness_4

import numpy as np
import matplotlib.pyplot as plt
 
def estimate_coef(x, y):
    # number of observations/points
    n = np.size(x)
 
    # mean of x and y vector
    m_x, m_y = np.mean(x), np.mean(y)
 
    # calculating cross-deviation and deviation about x
    SS_xy = np.sum(y*x) - n*m_y*m_x
    SS_xx = np.sum(x*x) - n*m_x*m_x
 
    # calculating regression coefficients
    b_1 = SS_xy / SS_xx
    b_0 = m_y - b_1*m_x
 
    return(b_0, b_1)
 
def plot_regression_line(x, y, b):
    # plotting the actual points as scatter plot
    plt.scatter(x, y, color = "m",
               marker = "o", s = 30)
 
    # predicted response vector
    y_pred = b[0] + b[1]*x
 
    # plotting the regression line
    plt.plot(x, y_pred, color = "g")
 
    # putting labels
    plt.xlabel('x')
    plt.ylabel('y')
 
    # function to show plot
    plt.show()
 
def main():
    # observations
    x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
 
    # estimating coefficients
    b = estimate_coef(x, y)
    print("Estimated coefficients:\nb_0 = {}  \
          \nb_1 = {}".format(b[0], b[1]))
 
    # plotting regression line
    plot_regression_line(x, y, b)
 
if __name__ == "__main__":
    main()

Output of above piece of code is:

Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437

And graph obtained looks like this: