Linear classifier using Stochastic Gradient Descent (SGD) learning

Gradient descent (GD) is an optimization technique used in many learning algorithms (e.g., backpropagation); it does not correspond to a specific family of machine learning models. The gradient descent search repeatedly updates the parameters/feature weights until it finds parameters that minimize the error. Since the gradient descent method uses the whole training set to compute the next update to parameters at each iteration, converging to a local minimum can sometimes be quite slow. The GD method cannot be executed on a small machine if the training set is enormous. Another issue with this batch optimization is there is no easy way to incorporate new data in an online setting. Also, if there are multiple local minima in the error surface, there is no guarantee that the procedure will find the global minimum.

Stochastic Gradient Descent (SGD) is a variation of the GD method that alleviates the GD method’s practical difficulties mentioned above. It is also called incremental gradient descent. The idea behind SGD is to approximate the GD search by updating parameters/feature weights incrementally using each individual example. Generally, each parameter update in SGD is performed using a minibatch of a few training examples instead of a single example. This reduces variance in the parameter update and leads to more stable convergence. In SGD, the learning rate is typically smaller than GD because there is much more variance in the update. The learning rate in SGD can be either constant or gradually decaying. The order of the data can also affect convergence in the SGD. If the data is given meaningful order, it can lead to poor convergence. So, it’s recommended to shuffle the data randomly before using SGD. One common problem with standard SGD is sometimes convergence can be very slow, particularly after the initial steep gains. Adding momentum to the parameter update rule is a technique to handle this issue.

As mentioned earlier, SGD itself is not a classification method. The SGDClassifier of scikit-learn implements a simple stochastic gradient descent learning routine that supports different loss functions and penalties for classification.
For example,

  • SGDClassifier(loss=’log’) results in logistic regression, i.e. a model equivalent to LogisticRegression.
  • SGDRegressor(loss=’squared_error’, penalty=’l2′) is similar to Ridge solver.
  • SGDClassifier(loss=”hinge”, penalty=”l2″) is equivalent to a linear SVM.

The advantages of SGD are Efficiency and Ease of implementation. The disadvantages of SGD include: SGD requires many hyperparameters such as the regularization parameters and the number of iterations. SGD is sensitive to feature scaling. That’s why you should always scale the input while running SGD. The scikit-learn package has a module for it.

The following example shows how to apply SGDClassifier to a classification problem. I generated 5000 random samples with 20 features each. Using loss=”log,” it is learning a logistics model. The code uses only one classification performance metric, “accuracy.”

import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

# genrate random training data and labels
X = np.array([np.random.random(20) for _ in range(5000)])
Y = np.array([np.random.randint(0, 2) for _ in range(5000)])

# Always scale the input. The most convenient way is to use a pipeline.
clf = make_pipeline(StandardScaler(), SGDClassifier(max_iter=5000, tol=1e-3, loss="log"))

# run 5-fold CV
y_pred = cross_val_predict(clf, X, Y, cv=5)
print(accuracy_score(Y, y_pred))

To verify the value of accuracy, I ran the logistic regression on the same data and found that the accuracy of the SGDClassifier with loss=’log’ is very close to the accuracy of the logistic regression model.

clf = LogisticRegression(random_state=0)
y_pred = cross_val_predict(clf, X, Y, cv=5)
print(accuracy_score(Y, y_pred))

The accuracy returned by the above codes is 0.512 (SGDClassifier) and 0.5046 (logistics regression).


  1. Optimization: Stochastic Gradient Descent
  2. Stochastic Gradient Descent

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.