less than 1 minute read


Stochastic Gradient Descent

Let’s say there are three students who work on the 100 math questions.

Student A works 100 questions at once then checks the answers.

Student B works on 10 questions at once then checks the answers and rework and work next other 10 questions

Student C works on 1 question at once then checks an answer and a next question step by step.

Student C stands for Stochastic Gradient Descent. As you can imagine, it takes less time but it can’t guarantee the accuracy.


Layer-wise Adaptive Rate Scaling(LARS)

In order to issue to train the model with large batch size, the LARS has been proposed.

Leave a comment