9

Can every continuous piecewise linear function $[-1,1]^k \rightarrow \mathbb{R}^n$ be written as a composition of the following building blocks:

  • Affine map: $x \mapsto Ax + b$ for some matrix $A$ and vector $b$
  • Relu activation: $(x_1, x_2, ...) \mapsto (\max(0, x_1), \max(0, x_2),...)$

If so, how many composition factors are needed? Can every such function be represented by a network with "only one hidden layer":

$$ \text{affine} \circ \text{relu} \circ \text{affine} \circ \text{relu} \circ \text{affine} $$

By piecewise linear, I mean that there exists decomposition of the domain $[-1,1]^k$ into finitely many polytopes, such that the restriction of the function to each polytope is affine.

Jan
  • 1,969
  • Did you try a case when there are just two polytopes? – AHusain Jun 15 '21 at 07:28
  • @AHusain great question. My answer below gives tools that proof the case of two polytopes (single separating hyperplane). – Jan Jun 15 '21 at 08:59
  • 1
    Not every such function can be build with only one hidden layer. In particular, for k > 1 every compactly supported ReLU neural network with 1 hidden layer is 0 everywhere. On the other hand, deep ReLU NNs can build all continuous p.w. affine linear functions, see https://arxiv.org/abs/1807.03973. The main idea of that paper is that it is easy to build first-order finite elements by ReLU nets. You can therefore build everything that is a sum of linear finite elements also as a ReLU neural network. This includes all continuous piecewise affine functions (Theorem 5.2 of the linked paper.) – pcp Jun 20 '21 at 20:25
  • Thanks a lot! I think it would be great if you could turn this comment into an answer, so it is easier to discover! – Jan Jun 22 '21 at 05:36
  • good idea. I added an answer. – pcp Jun 23 '21 at 07:12

2 Answers2

6

I call the set of all functions $\mathbb{R}^k \rightarrow \mathbb{R}^n$ that are a composition of relus and affine maps representable We want to show that the set of (restrictions to $[-1,1]^k$ of) representable functions contains the set of all piecewise linear functions. (The other direction is easy).

Let us show that the space of representable functions is closed under a couple of useful operations. Later we try to show that these operations suffice to generate all piecewise linear functions.

  • Every affine function is representable
  • The copy operation $f:\mathbb{R}^n \rightarrow \mathbb{R}^{2n}$ given by $x \mapsto (x, x)$ is representable
  • If $f:\mathbb{R}^{k_1} \rightarrow \mathbb{R}^{n_1}$ and $f:\mathbb{R}^{k_2} \rightarrow \mathbb{R}^{n_2}$ are both representable, so is their cartesian product $(x,y) \mapsto (f(x), g(y))$
  • If $f,g: \mathbb{R}^k \rightarrow \mathbb{R}^n$ are representable so is their sum $f+g: \mathbb{R}^{k} \rightarrow \mathbb{R}^n$
  • $f:\mathbb{R}^k \rightarrow \mathbb{R}^n$ is representable if and only if each of its coordinate functions $f_i:\mathbb{R}^k \rightarrow \mathbb{R}$ is representable.
  • Halfspace projections are representable. More precisely let $A \in \mathbb{R}^{1 \times k}$ and $b \in \mathbb{R}$. Denote by $H$ the set of all points that satisfy $Ax <= b$. Then $proj(x) = argmin_{h \in H} ||x-h||$ is representable.

Proof: By composing with affine maps, we can reduce to the case $A = (1,0,0,...)$ and $b=0$. Then the cartesian product function $(relu, id, id, id)$ does the job.

  • If $f,g$ are representable and coincide along a hyperplane, then so is the function obtained by glueing them together along that hyperplane. More precisely: Let $f,g:\mathbb{R}^n \rightarrow \mathbb{R}^k$ and $A \in \mathbb{R}^{1 \times k}$ and $b \in \mathbb{R}$. Assume $f(x) = g(x)$ for all $x$ that satisfy $(Ax = b)$. Then the piecewise linear function $r$ given by:

$$ \begin{align} r(x) &= f(x) \text{ for } Ax <= b \\ r(x) &= g(x) \text{ for } Ax >= b \end{align} $$

is representable.

Proof: As a warm up we assume that $f,g$ vanish on the $(Ax=b)$ hyperplane. Then have $r(x) = f(proj_+(x)) + g(proj_-(x))$ where $proj_-$ and $proj_+$ are the half space projections. This is a composition of representable functions.

In case $f,g$ do not vanish on the hyperplane we can decompose them as $f = f_0 + rest$ and $g = g_0 + rest$ where $rest(x) := f(proj_-(proj_+(x)))$. Now $f_0$ and $g_0$ vanish and we have $r(x) = f_0(x) + g_0(x) + rest(x)$.

To see that every piecewise linear function can be build from these operations, let $f$ be such a function. Then there exists a finite collection of hyperplanes such that the domain is covered by halfspace intersections on which $f$ is affine. I think we can use induction on the glue operation rule to represent $f$. But I have trouble spelling this out rigorously. It is a gap in the proof.

Jan
  • 1,969
2

Long version of my short comment:

First of all, not every piecewise affine linear function can be build by a ReLU neural network with only one hidden layer. The reason is that a compactly supported piecewise affine function, such as, $$ \mathbb{R}^d \ni x \mapsto \max\{0, 1 - \max_{i=1, \dots, d} |x_i| \} $$ cannot be represented by sums of ReLUs. The simple reason is that this function is smooth outside of a compact domain whereas a sum of ReLUs is either affine linear or has at least one line along which it is not smooth. (This is of course something one would need to prove in more detail. A proof can be found in Theorem 4.1 of https://arxiv.org/pdf/1807.03973.pdf.)

On the other hand, it was shown in https://arxiv.org/pdf/1807.03973.pdf that deep ReLU neural networks can represent linear finite elements. This is because one can write these hat functions as a combination of max and min operations. I can only do a worse job than the authors themselves in explaining how this is done. Their paper also has a lot of nice illustrations. Therefore I think it is best to just refer to Chapter 3 of https://arxiv.org/pdf/1807.03973.pdf.

From the construction of hat functions, it follows essentially directly that also all continuous piecewise linear functions can be represented by ReLU neural networks since every such function is a sum of hat functions. This is Theorem 5.2 of the work cited above.

pcp
  • 1,600