Scikit Learn
A machine learning library in Python, built on top of NumPy, SciPy, and matplotlib.
Datasets
Provides a number of toy datasets and utilities for accessing ‘real world’ datasets as outlined in their documentation
Estimators
Scikit-learn
provides implementations of dozens of algorithms and models, which it calls estimators
.
Each estimator provides a fit
method which accepts two inputs, typically numpy arrays:
The samples matrix
X
, the size of which is typically(n_samples, n_features)
, ie samples are rows and features are columns of a 2-D array.The target values
y
, real values for regression, integers or another discrete set for classification.y
is typically a 1-D array where the ith entry corresponds to the target of the ith row ofX
.
Once the estimator has been fitted, it can be used to predict the target values of new data using its predict
method. Here’s a very simple example:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[1,2,3], [11,12,13]]
y = [0,1]
clf.fit(X,y)
clf.predict([[4,5,6], [14,15,16]]) # predicts [0,1]
Transformers and Pre-processing
Typically we want to transform or impute the data in a pre-processing step before our final prediction. This is also done with estimator subclasses in scikit-learn
, except such transformers have a transform
method instead of a predict
method. A ColumnTransformer
can apply different transformations to different features.
Pipelines
A Pipeline
can be used to chain together transformers and predictors into a single object with the same API as a regular estimator (ie you can use fit
and predict
on it).
Evaluation
The library provides a train_test_split
helper method to split your dataset into training and test sets. It also provides support for cross-validation