VanderPlas: Chapter 05: Machine Learning
Metadata
Title: Machine Learning
Number: 5
Naive Bayes
Includes an example on text classification with multinomial Naive Bayes.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
model = make_pipeline(TfidfVetorizer(), MultinomialNB())
model.fit(train.data, train.target)
labels = model.predict(test.data)
mat = confusion_matrix(test.target, labels)
When to use it? It generally won’t perform as well as a more complex model, but its advantages are:
Very fast for training and prediction
Straightforward probabilistic prediction
Easily interpretable
Few tunable parameters
Good choice as a baseline classifier, if it works great! Job done, if not, then you have a baseline to check more complex models.
Works very well in the following situations:
When the naive assumptions match the data (very rare).
For well-separated categories, when model complexity is less important.
For very high-dimensional data.
Clusters in high dimensions tend to be more separated than clusters in low dimensions, assuming the new dimensions do add information. Simplistic classifiers can work better than more complex ones as the dimensionality grows.