Optimizer
Stochastic Gradient Descent
Let’s say there are three students who work on the 100 math questions.
Student A works 100 questions at once then checks the answers.
Student B works on 10 questions at once then checks the answers and rework and work next other 10 questions
Student C works on 1 question at once then checks an answer and a next question step by step.
Student C stands for Stochastic Gradient Descent. As you can imagine, it takes less time but it can’t guarantee the accuracy.
Layer-wise Adaptive Rate Scaling(LARS)
In order to issue to train the model with large batch size, the LARS has been proposed.
Leave a comment