Questions tagged [gradient]
37 questions
15
votes
4 answers
How to prevent vanishing gradient or exploding gradient?
What's causing the vanishing gradient or exploding gradient, and what are the measures to be taken to prevent it?
yashdk
- 189
- 1
- 1
- 4
4
votes
3 answers
Forward pass vs backward pass vs backpropagation
As mentioned in the question, I have some issues understanding what are the differences between those terms.
From what I have understood:
Forward pass: compute the output of the network given the input data
Backward pass: compute the output error…
Mattia Surricchio
- 421
- 3
- 5
- 15
4
votes
1 answer
Differentiable approximation for counting negative values in array
I have an array of time of arrivals and I want to convert it to count data using pytorch in a differentiable way.
Example arrival times:
arrival_times = [2.1, 2.9, 5.1]
and let's say the total range is 6 seconds. What I want to have is:
counts = [0,…
iRestMyCaseYourHonor
- 159
- 5
3
votes
1 answer
How batch normalization layer resolve the vanishing gradient problem?
According to this article:
https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484
The vanishing gradient problem occurs when using the sigmoid activation function because sigmoid maps large input space into small space, so the…
user3668129
- 769
- 4
- 15
3
votes
3 answers
Why a sign of gradient (plus or minus) is not enough for finding a steepest ascend?
Consider a simple 1-D function $y = x^2$ to find a maximum with the gradient ascent method.
If we start in point 3 on x-axis:
$$ \frac{\partial f}{\partial x} \biggr\rvert_{x=3} = 2x \biggr\rvert_{x=3} = 6 $$
This means that a direction in which…
Kenenbek Arzymatov
- 189
- 6
3
votes
1 answer
Gradient Checking: MeanSquareError. Why huge epsilon improves discrepancy?
I am using custom C++ code, and coded a simple "Mean Squared Error" layer.
Temporarily using it for the 'classification task', not a simple regression. ...maybe this causes the issues?
I don't have anything else before this layer - not even a simple…
Kari
- 2,756
- 2
- 21
- 51
2
votes
1 answer
Why does my manual derivative of Layer Normalization imply no gradient flow?
I recently tried computing the derivative of the layer norm function (https://arxiv.org/abs/1607.06450), an essential component of transformers, but the result suggests that no gradient flows through the operation, which can't be true.
Here's my…
Alex
- 23
- 5
2
votes
1 answer
Gradient passthough in PyTorch
I need to quantize the inputs, but the method (bucketize) I need to do so is indifferentiable. I can of course detach the tensor, but then I lose the flow of gradients to earlier weights. I guess the question is quite simple, how do you continue…
user3023715
- 203
- 2
- 5
2
votes
1 answer
How to choose appropriate epsilon value while approximating gradients to check training?
While approximating gradients, using actual epsilon to shift the weights results in wildly big gradient approximations, as the "width" of the used approximation triangle is disporportionately small. In Andrew NG-s course, he is using 0.01, but I…
Dávid Tóth
- 145
- 5
2
votes
1 answer
Tensorflow.Keras: How to get gradient for an output class w.r.t a given input?
I have implemented and trained a sequential model using tf.keras. Say I am given an input array of size 8X8 and an output [0,1,0,...(rest all 0)].
How to calculate the gradient of the input w.r.t to the given output?
model = ...
output =…
2
votes
1 answer
Vanishing Gradient vs Exploding Gradient as Activation function?
ReLU is used as an activation function that serves two purposes:
Breaking linearity in DNN.
Helping in handling Vanishing Gradient problem.
For Exploding Gradient problem, we use Gradient Clipping approach where we set the max threshold limit of…
vipin bansal
- 1,282
- 11
- 19
2
votes
1 answer
What does it mean for a method to be invariant to diagonal rescaling of the gradients?
In the paper which describes Adam: a method for stochastic optimization, the author states:
The method is straightforward to implement, is computationally
efficient, has little memory requirements, is invariant to diagonal
rescaling of the…
dhulmul
- 121
- 2
1
vote
0 answers
Which Neural Network or Gradient Boosting framework is the simplest for Custom Loss Functions?
I need to implement a custom loss function.
The function is relatively simple:
$$-\sum \limits_{i=1}^m [O_{1,i} \cdot y_i-1] \ \cdot \ \operatorname{ReLu}(O_{1,i} \cdot \hat{y_i} - 1)$$
With $O$ being some external attribute specific to each case.
I…
Borut Flis
- 199
- 3
- 7
1
vote
1 answer
vanishing gradient and gradient zero
There is a well known problem vanishing gradient in BackPropagation training of Feedforward Neural Network (FNN)(here we don't consider the vanishing gradient of Recurrent Neural Network).
I don't understand why vanishing gradient does not mean the…
user6703592
- 127
- 5
1
vote
1 answer
Can mini-batch gradient descent outperform batch gradient descent?
As I was reading and going through the second course of Andrew Ng's deep learning course, I came across a sentence that said,
With a well-turned mini-batch size, usually it outperforms either
gradient descent or stochastic gradient descent…
mitra mirshafiee
- 153
- 3