Alex's Notes

VanderPlas: Chapter 05: Machine Learning

Metadata

Naive Bayes

Includes an example on text classification with multinomial Naive Bayes.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix

model = make_pipeline(TfidfVetorizer(), MultinomialNB())

model.fit(train.data, train.target)
labels = model.predict(test.data)

mat = confusion_matrix(test.target, labels)

When to use it? It generally won’t perform as well as a more complex model, but its advantages are:

  • Very fast for training and prediction

  • Straightforward probabilistic prediction

  • Easily interpretable

  • Few tunable parameters

Good choice as a baseline classifier, if it works great! Job done, if not, then you have a baseline to check more complex models.

Works very well in the following situations:

  • When the naive assumptions match the data (very rare).

  • For well-separated categories, when model complexity is less important.

  • For very high-dimensional data.

    Clusters in high dimensions tend to be more separated than clusters in low dimensions, assuming the new dimensions do add information. Simplistic classifiers can work better than more complex ones as the dimensionality grows.

Key Quotes

Further Notes