@frankzhu

Naive Bayes v.s. Logistic Regression: A Deep Dive

October 30, 2022 / 4 min read

Last Updated: June 4, 2024

Premise

Before we dive into the math, let's first define a smallet meaningful example: classifying words as either complex or not complex.

Let's say we have a dataset of words, and we want to classify them as either complex or not complex. We have three features for each word: length, frequency, and syllables. We then use these features to train a model to predict the class of a word via Naive Bayes or Logistic Regression.

Naive Bayes

For Naive Bayes, we are assuming each feature is independent of each other in a common cause Bayes Net configuration, i.e., given a sample of words and features:

where and . Each is a feature vector where is the number of feature assignments; thus is a matrix. is a vector of class labels. To visualize our data matrix:

Note our parameter (the prior and likelihoods) are calculated from the training data using maximum likelihood/relative frequencies. In our case, we have two classes, so .

Again, to visualize our joint distribution matrix, we have the following matrix:

And the size of the joint distribution matrix is is , where is our number of feature assigments.

Inference for our Naive Bayes is done by first computing the posterior from the calculated joint:

Then our prediction for our words is found by maximizing the posterior (note is a normalizing constant), or maximizing the joint distribution:

Logistic Regression

Now for logistic regressions, we are stipulating that the following parametric representation; let :

Where is the sigmoidal function that models the conditional class probability. is a vector of parameters.

Instead of using relative frequencies from training data, Logistic regression's maximum likelihood finds the parameters that maximizes the likelihood function :

Inference is done by directly cacluating the , if , we set . In other words, we are finding the class label that maximizes the likelihood of the data given the parameters.

Now it's clear that Naive Bayes is a simpler model than Logistic regression. The conditional independence assumption stipulates that each feature is conditionally independent of each other given the class, which may not be true. In constrast, logistic regression does not make such an assumption and instead stipulates a sigmoidal relation between class labels. The parameters of logistic regression is also more flexible, allowing parameters that are not directly calculated from the training dataset.

Logistic regression has its own host of issues as well, though. If features are highly correlated, the variance inflation factor from will be high. The issue of multicollinearity is more pronounced if our library uses Newton-Ralphson instead of vanilla gradient descent, which requires us to calculate the Hessian. If is collinear, it's non-singular and thus non-invertible, leading to unstable coefficient estimates.

The issue of multicollinearity is more pronounced when we take into account statistical significance. I don't think scikit learn filters based on significance (which itself could be an issue), but when we do, the standard errors we get from the regression won't be correct. A high VIF leads to decreased statistical power, and thus incorrect predictions.

Have a wonderful day.

– Frank

A comparison of two popular classification algorithms!