32

I have an input which is a list and the output is the maximum of the elements of the input-list.

Can machine learning learn such a function which always selects the maximum of the input-elements present in the input?

This might seem as a pretty basic question but it might give me an understanding of what machine learning can do in general. Thanks!

Peter
  • 7,896
  • 5
  • 23
  • 50
user78739
  • 329
  • 1
  • 3
  • 3

7 Answers7

39

Maybe, but note that this is one of those cases where machine learning is not the answer. There is a tendency to try and shoehorn machine learning into cases where really, bog standard rules-based solutions are faster, simpler and just generally the right choice :P

Just because you can, doesn't mean you should

Edit: I originally wrote this as "Yes, but note that..." but then started to doubt myself, having never seen it done. I tried it out this afternoon and it's certainly doable:

import numpy as np
from keras.models import Model
from keras.layers import Input, Dense, Dropout
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping

# Create an input array of 50,000 samples of 20 random numbers each
x = np.random.randint(0, 100, size=(50000, 20))

# And a one-hot encoded target denoting the index of the maximum of the inputs
y = to_categorical(np.argmax(x, axis=1), num_classes=20)

# Split into training and testing datasets
x_train, x_test, y_train, y_test = train_test_split(x, y)

# Build a network, probaly needlessly complicated since it needs a lot of dropout to
# perform even reasonably well.

i = Input(shape=(20, ))
a = Dense(1024, activation='relu')(i)
b = Dense(512, activation='relu')(a)
ba = Dropout(0.3)(b)
c = Dense(256, activation='relu')(ba)
d = Dense(128, activation='relu')(c)
o = Dense(20, activation='softmax')(d)

model = Model(inputs=i, outputs=o)

es = EarlyStopping(monitor='val_loss', patience=3)

model.compile(optimizer='adam', loss='categorical_crossentropy')

model.fit(x_train, y_train, epochs=15, batch_size=8, validation_data=[x_test, y_test], callbacks=[es])

print(np.where(np.argmax(model.predict(x_test), axis=1) == np.argmax(y_test, axis=1), 1, 0).mean())

Output is 0.74576, so it's correctly finding the max 74.5% of the time. I have no doubt that that could be improved, but as I say this is not a usecase I would recommend for ML.

EDIT 2: Actually I re-ran this this morning using sklearn's RandomForestClassifier and it performed significantly better:

# instantiation of the arrays is identical

rfc = RandomForestClassifier(n_estimators=1000, verbose=1)
rfc.fit(x_train, y_train)

yhat_proba = rfc.predict_proba(x_test)


# We have some annoying transformations to do because this .predict_proba() call returns the data in a weird format of shape (20, 12500, 2).

for i in range(len(yhat_proba)):
    yhat_proba[i] = yhat_proba[i][:, 1]

pyhat = np.reshape(np.ravel(yhat_proba), (12500,20), order='F')

print(np.where(np.argmax(pyhat, axis=1) == np.argmax(y_test, axis=1), 1, 0).mean())

And the score here is 94.4% of samples with the max correctly identified, which is pretty good indeed.

Dan Scally
  • 1,784
  • 8
  • 26
28

Yes. Very importantly, YOU decide the architecture of a machine learning solution. Architectures and training procedures don't write themselves; they must be designed or templated and the training follows as a means of discovering a parameterization of the architecture fitting to a set of data points.

You can construct a very simple architecture that actually includes a maximum function:

net(x) = a * max(x) + b * min(x)

where a and b are learned parameters.

Given enough training samples and a reasonable training routine, this very simple architecture will learn very quickly to set a to 1 and b to zero for your task.

Machine learning often takes the form of entertaining multiple hypotheses about featurization and transformation of input data points, and learning to preserve only those hypotheses that are correlated with the target variable. The hypotheses are encoded explicitly in the architecture and sub-functions available in a parameterized algorithm, or as the assumptions encoded in a "parameterless" algorithm.

For example, the choice to use dot products and nonlinearities as is common in vanilla neural network ML is somewhat arbitrary; it expresses the encompassing hypothesis that a function can be constructed using a predetermined compositional network structure of linear transformations and threshold functions. Different parameterizations of that network embody different hypotheses about which linear transformations to use. Any toolbox of functions can be used and a machine learner's job is to discover through differentiation or trial and error or some other repeatable signal which functions or features in its array best minimize an error metric. In the example given above, the learned network simply reduces to the maximum function itself, whereas an undifferentiated network could alternatively "learn" a minimum function. These functions can be expressed or approximated via other means, as in the linear or neural net regression function in another answer. In sum, it really depends on which functions or LEGO pieces you have in your ML architecture toolbox.

pygosceles
  • 459
  • 3
  • 4
7

Yes - Machine learning can learn to find the maximum in a list of numbers.

Here is a simple example of learning to find the index of the maximum:

import numpy as np
from sklearn.tree import DecisionTreeClassifier

# Create training pairs where the input is a list of numbers and the output is the argmax
training_data = np.random.rand(10_000, 5) # Each list is 5 elements; 10K examples
training_targets = np.argmax(input_data, axis=1)

# Train a descision tree with scikit-learn
clf = DecisionTreeClassifier()
clf.fit(input_data, targets)

# Let's see if the trained model can correctly predict the argmax for new data
test_data = np.random.rand(1, 5)
prediction = clf.predict(test_data)
assert prediction == np.argmax(test_data) # The test passes - The model has learned argmax
Brian Spiering
  • 23,131
  • 2
  • 29
  • 113
4

Learning algorithms

Instead of learning a function as a calculation done by a feed-forward neural network, there's a whole research domain regarding learning algorithms from sample data. For example, one might use something like a Neural Turing Machine or some other method where execution of an algorithm is controlled by machine learning at its decision points. Toy algoritms like finding a maximum, or sorting a list, or reversing a list, or filtering a list are commonly used as examples in algorithm learning research.

Peteris
  • 395
  • 2
  • 6
3

I will exclude educated designs from my answer. No it is not possible to use an out of the box machine learning (ML) approach to fully represent the maximum function for arbitrary lists with arbitrary precision. ML is a data-based method and it is clear that you will not be able to approximate a function at regions where you do not have any data points. Hence, the space of possible observations (which is infinite) cannot be covered by finite observations.

My statements have a theoretical foundation with Cybeko’s Universal Approximation Theorem for neural networks. I will quote the theorem from Wikipedia:

In the mathematical theory of artificial neural networks, the universal approximation theorem states[1] that a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of $\mathbb{R}^n$, under mild assumptions on the activation function. The theorem thus states that simple neural networks can represent a wide variety of interesting functions when given appropriate parameters; however, it does not touch upon the algorithmic learnability of those parameters.

The most important part is the bounded subset of $\mathbb{R}^n$. This additional statement restricts the application of approximating the maximum function for $x\in \mathbb{R}$. This restriction is manifesting itself in the poor fit of the model from the answer with the most upvotes.

If your space of observations is compact then you might be able to approximate the maximum function with a finite data set. As the top voted answer made clear you should not reinvent the wheel!

MachineLearner
  • 1,978
  • 9
  • 15
1

Here's an expansion on my comment. To preface, absolutely @DanScally is right that there's no reason to use ML for finding a maximum of a list. But I think your "it might give me an understanding of what machine learning can do in general" is good enough reason to delve into this.

You ask about more general machine learning, but I'll focus on neural networks. In that context, we must first ask whether the actual functions produced by a neural network can approximate (or evaluate exactly) $\max$, and only then can we further inquire whether any of the (common?) training methods can fit a NN approximating $\max$.


The comments, and @MachineLearner's answer brought up universal approximation theorems: on a bounded domain, a neural network can approximate any reasonably nice function like $\max$, but we can't expect a priori to approximate $\max$ on arbitrary input, nor to exactly calculate $\max$ anywhere.

But, it turns out that a neural network can exactly sort arbitrary input numbers. Indeed, $n$ $n$-bit integers can be sorted by a network with just two hidden layers of quadratic size. Depth Efficient Neural Networks for Division and Related Problems, Theorem 7 on page 955; many thanks to @MaximilianJanisch in this answer for finding this reference.

I'll briefly describe a simplification of the approach in that paper to produce the $\operatorname{argmax}$ function for $n$ arbitrary distinct inputs. The first hidden layer consists of $\binom{n}{2}$ neurons, each representing the indicator variable $\delta_{ij} = \mathbf{1}(x_i < x_j)$, for $i<j$. These are easily built as $x_j-x_i$ with a step activation function. The next layer has $n$ neurons, one for each input $x_i$; start with the sum $\sum_{j<i} \delta_{ji} + \sum_{j>i} (1-\delta_{ij})$; that is, the number of $j$ such that $x_i>x_j$, and hence the position of $x_i$ in the sorted list. To complete the argmax, just threshold this layer.
At this point, if we could multiply, we'd get the actual maximum value pretty easily. The solution in the paper is to use the binary representation of the numbers, at which point binary multiplication is the same as thresholded addition. To just get the argmax, it suffices to have a simple linear function multiplying the $i$th indicator by $i$ and summing.


Finally, for the subsequent question: can we can train a NN into this state. @DanScally got us started; maybe knowing the theoretical architecture can help us cheat into the solution? (Note that if we can learn/approximate the particular set of weights above, the net will actually perform well outside the range of the training samples.)

Notebook in github / Colab

Changing things just a little bit, I get better testing score (0.838), and even testing on a sample outside the original training range gets a decent score (0.698). Using inputs scaled to $[-1,1]$ gets the test score up to 0.961, with an out-of-range score of 0.758. But, I'm scoring with the same method as @DanScally, which seems a little dishonest: the identity function will score perfectly on this metric. I also printed out a few coefficients to see whether anything close to the above described exact fit appears (not really); and a few raw outputs, which suggest the model is too timid in predicting a maximum, erring on the side of predicting that none of the inputs are the maximum. Maybe modifying the objective could help, but at this point I've put in too much time already; if anyone cares to improve the approach, feel free to play (in Colab if you like) and let me know.

Ben Reiniger
  • 12,855
  • 3
  • 20
  • 63
0

Yes, even as simple machine learning as ordinary linear least squares can do this if you use some applied cleverness.

(But most would consider this quite horrible overkill).

(I will assume we want to find max of abs of input vector):

  1. Select a monotonically decreasing function of absolute value, for example $$f(x) = \frac{1}{x^2}$$
  2. Build diagonal matrix of $f({\bf r})$. Let us call it $\bf C_r$
  3. Build vector full of ones $\bf S$.
  4. Build and solve equation system $(\epsilon {\bf I}+10^3{\bf S}^t{\bf S}+{\bf C_r})^{-1}(10^3 {\bf S}^t)$
  5. Let us call result vector $\bf p$, it will be a probability measure (sums to 1), we can reweigh it nonlinearly, for example $$p_i = \frac{p_i^k}{\sum|p_i|^k}$$
  6. Just calculate scalar product with index vector and round.
mathreadler
  • 111
  • 4