Alex's Notes

Scikit Learn

A machine learning library in Python, built on top of NumPy, SciPy, and matplotlib.

Datasets

Provides a number of toy datasets and utilities for accessing ‘real world’ datasets as outlined in their documentation

Estimators

Scikit-learn provides implementations of dozens of algorithms and models, which it calls estimators.

Each estimator provides a fit method which accepts two inputs, typically numpy arrays:

  1. The samples matrix X, the size of which is typically (n_samples, n_features), ie samples are rows and features are columns of a 2-D array.

  2. The target values y, real values for regression, integers or another discrete set for classification. y is typically a 1-D array where the ith entry corresponds to the target of the ith row of X.

Once the estimator has been fitted, it can be used to predict the target values of new data using its predict method. Here’s a very simple example:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[1,2,3], [11,12,13]]
y = [0,1]
clf.fit(X,y)
clf.predict([[4,5,6], [14,15,16]]) # predicts [0,1]

Transformers and Pre-processing

Typically we want to transform or impute the data in a pre-processing step before our final prediction. This is also done with estimator subclasses in scikit-learn, except such transformers have a transform method instead of a predict method. A ColumnTransformer can apply different transformations to different features.

Pipelines

A Pipeline can be used to chain together transformers and predictors into a single object with the same API as a regular estimator (ie you can use fit and predict on it).

Evaluation

The library provides a train_test_split helper method to split your dataset into training and test sets. It also provides support for cross-validation

Specific Algorithms

Nearest Neighbours

Decision Trees

Linear Regression