Learning Model Building in Scikit-learn : A Python Machine Learning Library

Pre-requisite: Getting started with machine learning
scikit-learn is an open source Python library that implements a range of machine learning, pre-processing, cross-validation and visualization algorithms using a unified interface.

Important features of scikit-learn:

Simple and efficient tools for data mining and data analysis. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, etc.
Accessible to everybody and reusable in various contexts.
Built on the top of NumPy, SciPy, and matplotlib.
Open source, commercially usable – BSD license.

In this article, we are going to see how we can easily build a machine learning model using scikit-learn.

Installation:

Scikit-learn requires:

NumPy
SciPy as its dependencies.

Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, the easiest way to install scikit-learn is using pip:

pip install -U scikit-learn

Let us get started with the modeling process now.

Step 1: Load a dataset

A dataset is nothing but a collection of data. A dataset generally has two main components:

Features: (also known as predictors, inputs, or attributes) they are simply the variables of our data. They can be more than one and hence represented by a feature matrix (‘X’ is a common notation to represent feature matrix). A list of all the feature names is termed as feature names.
Response: (also known as the target, label, or output) This is the output variable depending on the feature variables. We generally have a single response column and it is represented by a response vector (‘y’ is a common notation to represent response vector). All the possible values taken by a response vector is termed as target names.

Loading exemplar dataset: scikit-learn comes loaded with a few example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.
Given below is an example of how one can load an exemplar dataset:

# load the iris dataset as an example 
from sklearn.datasets import load_iris 
iris = load_iris() 
# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 
# store the feature and target names 
feature_names = iris.feature_names 
target_names = iris.target_names 
# printing features and target names of our dataset 
print("Feature names:", feature_names) 
print("Target names:", target_names) 
# X and y are numpy arrays 
print("\nType of X is:", type(X)) 
# printing first 5 input rows 
print("\nFirst 5 rows of X:\n", X[:5])

Output:

Feature names: ['sepal length (cm)','sepal width (cm)',
                'petal length (cm)','petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']

Type of X is: 

First 5 rows of X:
 [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]

Loading external dataset: Now, consider the case when we want to load an external dataset. For this purpose, we can use pandas library for easily loading and manipulating dataset.

To install pandas, use the following pip command:

pip install pandas

In pandas, important data types are:

Series: Series is a one-dimensional labeled array capable of holding any data type.

DataFrame: It is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

Note: The CSV file used in example below can be downloaded from here: weather.csv

import pandas as pd 
# reading csv file 
data = pd.read_csv('weather.csv') 
# shape of dataset 
print("Shape:", data.shape) 
# column names 
print("\nFeatures:", data.columns) 
# storing the feature matrix (X) and response vector (y) 
X = data[data.columns[:-1]] 
y = data[data.columns[-1]] 
# printing first 5 rows of feature matrix 
print("\nFeature matrix:\n", X.head()) 
# printing first 5 values of response vector 
print("\nResponse vector:\n", y.head())

Output:

Shape: (14, 5)

Features: Index([u'Outlook', u'Temperature', u'Humidity', 
                u'Windy', u'Play'], dtype='object')

Feature matrix:
     Outlook Temperature Humidity  Windy
0  overcast         hot     high  False
1  overcast        cool   normal   True
2  overcast        mild     high   True
3  overcast         hot   normal  False
4     rainy        mild     high  False

Response vector:
0    yes
1    yes
2    yes
3    yes
4    yes
Name: Play, dtype: object

Step 2: Splitting the dataset

#################################

####################

sklearn.model_selection.train_test_split(*arrays, **options)[source]

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the User Guide.

Parameters:

Parameters:	arrays : sequence of indexables with same length / shape[0] Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. test_size* : float, int or None, optional (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If `train_size` is also None, it will be set to 0.25. train_size : float, int, or None, (default=None) If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by `np.random`. shuffle : boolean, optional (default=True) Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None. stratify : array-like or None (default=None) If not None, data is split in a stratified fashion, using this as the class labels.
Returns:	splitting : list, length=2 * len(arrays) List containing train-test split of inputs. New in version 0.16: If the input is sparse, the output will be a `scipy.sparse.csr_matrix`. Else, output type is the same as the input type.

*arrays : sequence of indexables with same length / shape[0]: Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int or None, optional (default=None): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_size : float, int, or None, (default=None): If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
random_state : int, RandomState instance or None, optional (default=None): If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
shuffle : boolean, optional (default=True): Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
stratify : array-like or None (default=None): If not None, data is split in a stratified fashion, using this as the class labels.

Returns:

splitting : list, length=2 * len(arrays): List containing train-test split of inputs.

New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Examples

>>>

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]

>>>

>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]

>>>

>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]

One important aspect of all machine learning models is to determine their accuracy. Now, in order to determine their accuracy, one can train the model using the given dataset and then predict the response values for the same dataset using that model and hence, find the accuracy of the model.
But this method has several flaws in it, like:

Goal is to estimate likely performance of a model on an out-of-sample data.
Maximizing training accuracy rewards overly complex models that won’t necessarily generalize our model.
Unnecessarily complex models may over-fit the training data.

A better option is to split our data into two parts: first one for training our machine learning model, and second one for testing our model.
To summarize:

Split the dataset into two pieces: a training set and a testing set.
Train the model on the training set.
Test the model on the testing set, and evaluate how well our model did.

Advantages of train/test split:

Model can be trained and tested on different data than the one used for training.
Response values are known for the test dataset, hence predictions can be evaluated
Testing accuracy is a better estimate than training accuracy of out-of-sample performance.

Consider the example below:

# load the iris dataset as an example 
from sklearn.datasets import load_iris 
iris = load_iris() 
# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 
# splitting X and y into training and testing sets 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 
# printing the shapes of the new X objects 
print(X_train.shape) 
print(X_test.shape) 
# printing the shapes of the new y objects 
print(y_train.shape) 
print(y_test.shape)

Output:

(90L, 4L)
(60L, 4L)
(90L,)
(60L,)

The train_test_split function takes several arguments which are explained below:

X, y: These are the feature matrix and response vector which need to be splitted.
test_size: It is the ratio of test data to the given data. For example, setting test_size = 0.4 for 150 rows of X produces test data of 150 x 0.4 = 60 rows.
random_state: If you use random_state = some_number, then you can guarantee that your split will be always the same. This is useful if you want reproducible results, for example in testing for consistency in the documentation (so that everybody can see the same numbers).

Step 3: Training the model

Now, its time to train some prediction-model using our dataset. Scikit-learn provides a wide range of machine learning algorithms which have a unified/consistent interface for fitting, predicting accuracy, etc.

The example given below uses KNN (K nearest neighbors) classifier.

Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.

Now, consider the example below:

# load the iris dataset as an example 
from sklearn.datasets import load_iris 
iris = load_iris() 
# store the feature matrix (X) and response vector (y) 
X = iris.data 
y = iris.target 
# splitting X and y into training and testing sets 
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1) 
# training the model on training set 
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors=3) 
knn.fit(X_train, y_train) 
# making predictions on the testing set 
y_pred = knn.predict(X_test) 
# comparing actual response values (y_test) with predicted response values (y_pred) 
from sklearn import metrics 
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred)) 
# making prediction for out of sample data 
sample = [[3, 5, 4, 2], [2, 3, 5, 4]] 
preds = knn.predict(sample) 
pred_species = [iris.target_names[p] for p in preds] 
print("Predictions:", pred_species) 
# saving the model 
from sklearn.externals import joblib 
joblib.dump(knn, 'iris_knn.pkl')

Output:

kNN model accuracy: 0.983333333333
Predictions: ['versicolor', 'virginica']

Important points to note from the above code:

We create a knn classifier object using:

knn = KNeighborsClassifier(n_neighbors=3)

The classifier is trained using X_train data. The process is termed as fitting. We pass the feature matrix and the corresponding response vector.
```
knn.fit(X_train, y_train)
```
Now, we need to test our classifier on the X_test data. knn.predict method is used for this purpose. It returns the predicted response vector, y_pred.
```
y_pred = knn.predict(X_test)
```
Now, we are interested in finding the accuracy of our model by comparing y_test and y_pred. This is done using metrics module’s method accuracy_score:
```
print(metrics.accuracy_score(y_test, y_pred))
```
Consider the case when you want your model to make prediction on out of sample data. Then, the sample input can simply pe passed in the same way as we pass any feature matrix.
```
sample = [[3, 5, 4, 2], [2, 3, 5, 4]]
preds = knn.predict(sample)
```
If you are not interested in training your classifier again and again and use the pre-trained classifier, one can save their classifier using joblib. All you need to do is:
```
joblib.dump(knn, 'iris_knn.pkl')
```
In case you want to load an already saved classifier, use the following method:
```
knn = joblib.load('iris_knn.pkl')
```

As we approach the end of this article, here are some benefits of using scikit-learn over some other machine learning libraries(like R libraries):

Consistent interface to machine learning models
Provides many tuning parameters but with sensible defaults
Exceptional documentation
Rich set of functionality for companion tasks.
Active community for development and support.

References:

#######################################################3

KNeighborsClassifier		response	y=x+2
x
1		3
6		8
8		10
99		101


input data => 4 rows

first 2 rows ==> train data

last 2 rows => test data
	train data
x_train		y_train
1		3
6		8

identified formula		y=x+2

	test data		y=x+2
x_test		y_test	y_predict
1		3	3
6		8	7.5

y_pred ~ y_test	1		0.8
(0 to 1)

# load the iris dataset as an example
from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

iris = load_iris()

# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target

###################################################
#Step 2: Splitting the dataset
###################################################

# splitting X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4,shuffle=False, random_state=1)

# printing the shapes of the new X objects
print(X_train.shape)
print(X_test.shape)

###################################################
## 3 training the model on training set
###################################################

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train, y_train)

# making predictions on the testing set

y_pred = knn.predict(X_test)

# comparing actual response values (y_test) with predicted response values (y_pred)

from sklearn import metrics
print("kNN model accuracy:", metrics.accuracy_score(y_test, y_pred))

# making prediction for out of sample data
#sample = [[3, 5, 4, 2], [2, 3, 5, 4]]

#preds = knn.predict(sample)

#pred_species = [iris.target_names[p] for p in preds]
#print("Predictions:", pred_species)
# saving the model
#from sklearn.externals import joblib
#joblib.dump(knn, 'iris_knn.pkl')

# printing the shapes of the new y objects
#print(y_train.shape)
#print(y_test.shape)

######################################

D:\lab\batch84>python
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.datasets import load_iris
>>> iris=load_iris()
>>> x=iris.data
>>> print(x)
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3. 1.4 0.1]
[4.3 3. 1.1 0.1]
[5.8 4. 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1. 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5. 3. 1.6 0.2]
[5. 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.2]
[5. 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.6 1.4 0.1]
[4.4 3. 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5. 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5. 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3. 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5. 3.3 1.4 0.2]
[7. 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4. 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1. ]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5. 2. 3.5 1. ]
[5.9 3. 4.2 1.5]
[6. 2.2 4. 1. ]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3. 4.5 1.5]
[5.8 2.7 4.1 1. ]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4. 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3. 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3. 5. 1.7]
[6. 2.9 4.5 1.5]
[5.7 2.6 3.5 1. ]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1. ]
[5.8 2.7 3.9 1.2]
[6. 2.7 5.1 1.6]
[5.4 3. 4.5 1.5]
[6. 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3. 4.1 1.3]
[5.5 2.5 4. 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3. 4.6 1.4]
[5.8 2.6 4. 1.2]
[5. 2.3 3.3 1. ]
[5.6 2.7 4.2 1.3]
[5.7 3. 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3. 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6. 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3. 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3. 5.8 2.2]
[7.6 3. 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2. ]
[6.4 2.7 5.3 1.9]
[6.8 3. 5.5 2.1]
[5.7 2.5 5. 2. ]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3. 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6. 2.2 5. 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2. ]
[7.7 2.8 6.7 2. ]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6. 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3. 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3. 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2. ]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3. 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6. 3. 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3. 5.2 2.3]
[6.3 2.5 5. 1.9]
[6.5 3. 5.2 2. ]
[6.2 3.4 5.4 2.3]
[5.9 3. 5.1 1.8]]

>>> y=iris.target
>>> print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
>>> from sklearn.model_selection import train_test_split
>>> x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.4,shuffle=True)
>>> print(x_train.shape)
(90, 4)
>>> print(x_test.shape)
(60, 4)
>>> from sklearn.neighbors import KNeighborsClassifier
>>> knn=KNeighborsClassifier(n_neighbors=3)
>>> knn.fit(x_train,y_train)
KNeighborsClassifier(n_neighbors=3)
>>> y_pred=knn.predict(x_test)
>>> from sklearn import metrics
>>> print("accuracy:",metrics.accuracy_score(y_test,y_pred)
... )
accuracy: 0.95
>>>