Naive Bayes v.s. Logistic Regression: A Deep Dive
October 30, 2022 / 4 min read
Last Updated: June 4, 2024Premise
Before we dive into the math, let's first define a smallet meaningful example: classifying words as either complex or not complex.
Let's say we have a dataset of words, and we want to classify them as either complex or not complex. We have three features for each word: length, frequency, and syllables. We then use these features to train a model to predict the class of a word via Naive Bayes or Logistic Regression.
Naive Bayes
For Naive Bayes, we are assuming each feature is independent of each other in a common cause Bayes Net configuration, i.e., given a sample of
where
Note our parameter
Again, to visualize our joint distribution matrix, we have the following matrix:
And the size of the joint distribution matrix is is
Inference for our Naive Bayes is done by first computing the posterior from the calculated joint:
Then our prediction
Logistic Regression
Now for logistic regressions, we are stipulating that the following parametric representation; let
Where
Instead of using relative frequencies from training data, Logistic regression's maximum likelihood finds the parameters
Inference is done by directly cacluating the
Now it's clear that Naive Bayes is a simpler model than Logistic regression. The conditional independence assumption stipulates that each feature is conditionally independent of each other given the class, which may not be true. In constrast, logistic regression does not make such an assumption and instead stipulates a sigmoidal relation between class labels. The parameters of logistic regression is also more flexible, allowing parameters that are not directly calculated from the training dataset.
Logistic regression has its own host of issues as well, though. If features are highly correlated, the variance inflation factor from
The issue of multicollinearity is more pronounced when we take into account statistical significance. I don't think scikit learn filters based on significance (which itself could be an issue), but when we do, the standard errors we get from the regression won't be correct. A high VIF leads to decreased statistical power, and thus incorrect predictions.
Have a wonderful day.
– Frank
A comparison of two popular classification algorithms!