Z-score relations to perceptrons

Question

I am learning about preceptrons and my professor noted that z-scores are a commmon pre-processing step to normalizing input variables. Following this, I am having trouble understanding why z-scores are useful when training a preceptron?

My current understanding is that z-scores allows us to calculate the probability of a score landing inside the normal distribution and compare with scores from other normal distributions. However, how does this translate to perceptrons?

score 3 · Accepted Answer · answered Oct 06 '16 at 20:51

There are two parts to your question that need to be addressed.

What transformation are you performing?

Why does that help learning?

The transformation of taking your input and turning them into "Z-scores" is really just centering and standardizing the variance for each variable. The reason I put "Z-score" in quotes is because there is no reason to assume that this transformed variable is distributed normally. All you have done is taken a random variable with some unknown distribution and transform it to a random variable with mean 0 and variance 1. That doesn't necessarily mean it is normally distributed. Formally, if we have $$X \sim f(\mu, \sigma^2)$$ where $X$ is a random variable $f$ is the distribution with mean $\mu$ and variance $\sigma^2$, then $$\frac{X - \mu}{\sigma} \sim f(0, 1)$$ using the basic properties of mean and variance. Nowhere does this require $X \sim N(\mu, \sigma^2)$, which is where Z-score comes from.

With that said, why does this help? By centering your data (removing the mean) from both your training data and labels, you don't need to estimate a bias term. Bias (or mean measured effect) is function of the mean of your data; therefore if your data have mean 0, you should have no measurable bias term. Setting all variables to have unit variance places them all on the same scale. Oftentimes in real data the different variables in your model may have widely different variances, which could lead to difficult to compare effect sizes. Setting everything to have similar variance simplifies interpretation.

score 3 · Answer 2 · answered Oct 06 '16 at 21:49

Why does this help? Part of the reason is because of properties of the activation function.

Typically, most activation functions have their most interesting behavior around 0. For instance, the ReLu activation function switches from $f(x)=0$ to $f(x)=x$ at $x=0$. The sigmoid activation function has most of its interesting behavior at $x=0$, and plateaus at very large positive values of $x$ (and very negative values of $x$). Therefore, it's useful to have the inputs to the activation function be centered at 0. Transforming the inputs to have mean 0 ensures that most values are near the interesting part of the activation function.

Suppose you didn't transform the inputs, and inputs had values in the range [100,101]. Then with reasonable initial values of the weights, the input to the activation function would be some huge value (something in the vicinity of 100 times the number of inputs to that neuron). As a result, the activation function would be in a part of the input space where the activation function does nothing interesting (it's either flat or it's linear), and you're not taking advantage of the power of the neural network -- it'd be like trying to classify using a neural network where you've omitted the activation function entirely. That's not going to work well -- the activation function is an essential part of the neural network.

Now with sufficient learning, the neural network could adapt to the lack of scaling, by choosing suitable weights that basically do the rescaling for you. But initially the network is going to do very poorly, because the initial weights will be so terrible, and the optimization method (stochastic gradient descent) is going to have a hard time figuring out how to choose better weights: you're so far away from a good part of the space that any small change to the weights makes little improvement to overall performance.

So, rescaling is a heuristic that tends to make stochastic gradient descent more effective, and helps training converge more quickly and more reliably.

Z-score relations to perceptrons

2 Answers2