Questions tagged [gradient-descent]

Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.

Gradient descent is a first-order iterative optimization algorithm. It is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Gradient descent is also known as steepest descent, or the method of steepest descent.

https://en.wikipedia.org/wiki/Gradient_descent

446 questions
79
votes
7 answers

What is the difference between Gradient Descent and Stochastic Gradient Descent?

What is the difference between Gradient Descent and Stochastic Gradient Descent? I am not very familiar with these, can you describe the difference with a short example?
47
votes
5 answers

Does gradient descent always converge to an optimum?

I am wondering whether there is any scenario in which gradient descent does not converge to a minimum. I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum…
28
votes
4 answers

Scikit-learn: Getting SGDClassifier to predict as well as a Logistic Regression

A way to train a Logistic Regression is by using stochastic gradient descent, which scikit-learn offers an interface to. What I would like to do is take a scikit-learn's SGDClassifier and have it score the same as a Logistic Regression here.…
hlin117
  • 685
  • 1
  • 8
  • 11
24
votes
1 answer

How does Gradient Descent and Backpropagation work together?

Please forgive me as I am new to this. I have attached a diagram trying to model my understanding of neural network and Back-propagation? From videos on Coursera and resources online I formed the following understanding of how neural network…
23
votes
2 answers

Why ReLU is better than the other activation functions

Here the answer refers to vanishing and exploding gradients that has been in sigmoid-like activation functions but, I guess, Relu has a disadvantage and it is its expected value. there is no limitation for the output of the Relu and so its expected…
15
votes
4 answers

How to prevent vanishing gradient or exploding gradient?

What's causing the vanishing gradient or exploding gradient, and what are the measures to be taken to prevent it?
yashdk
  • 189
  • 1
  • 1
  • 4
15
votes
4 answers

Is Gradient Descent central to every optimizer?

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.
15
votes
2 answers

Why averaging the gradient works in Gradient Descent?

In Full-batch Gradient descent or Minibatch-GD we are getting gradient from several training examples. We then average them out to get a "high-quality" gradient, from several estimations and finally use it to correct the network, at once. But why…
Kari
  • 2,756
  • 2
  • 21
  • 51
12
votes
1 answer

What feature engineering is necessary with tree based algorithms?

I understand data hygiene, which is probably the most basic feature engineering. That is making sure all your data is properly loaded, making sure N/As are treated as a special value rather than a number between -1 and 1, and tagging your…
12
votes
4 answers

In training a neural network, why don’t we take the derivative with respect to the step size in gradient descent?

This is one of those questions where I know I am wrong, but I don't know how. I understand that when training a neural network, we calculate the derivatives of the loss function with respect to the parameters. I also understand that these…
Leo Juhlin
  • 123
  • 5
11
votes
2 answers

Why is learning rate causing my neural network's weights to skyrocket?

I am using tensorflow to write simple neural networks for a bit of research and I have had many problems with 'nan' weights while training. I tried many different solutions like changing the optimizer, changing the loss, the data size, etc. but with…
10
votes
4 answers

What is momentum in neural network?

While using "Two class neural network" in Azure ML, I encountered "Momentum" property. As per documentation, which is not clear, it says For The momentum, type a value to apply during learning as a weight on nodes from previous…
10
votes
1 answer

How flexible is the link between objective function and output layer activation function?

It seems standard in many neural network packages to pair up the objective function to be minimised with the activation function in the output layer. For instance, for a linear output layer used for regression it is standard (and often only choice)…
Neil Slater
  • 29,388
  • 5
  • 82
  • 101
10
votes
1 answer

What is the difference between SGD classifier and the Logisitc regression?

To my understanding, the SGD classifier, and Logistic regression seems similar. An SGD classifier with loss = 'log' implements Logistic regression and loss = 'hinge' implements Linear SVM. I also understand that logistic regression uses gradient…
10
votes
3 answers

Why do we use gradients instead of residuals in Gradient Boosting?

I have found mentions of two advantages in using gradients instead of actual residuals: 1) Using gradients will allow us to plug in any loss function (not just mse) without having to change our base learners to make them compatible with the loss…
eyio
  • 101
  • 1
  • 3
1
2 3
29 30