A project of ‘Martinis and Research’ - a Sydney based collaboration of ML enthusiasts, namely Varun Nayyar, Ben Jelliffe and Aidan Morrison

Introduction, and Activation Functions

Let’s break it right down

Neural networks are pretty cool right now. Especially deep ones, which are great for doing traditionally difficult high-dimensional tasks, like image classification. But figuring out how to design a good network and associated training regime for a given task is tricky. Many wouldn’t know where to start.

So let’s start at the very beginning… A simple feed-forward neural network with one hidden layer, trying to predict a simple, noiseless function (say, a sin wave) in just one dimension.

Sound’s boring?? Great! We can help make neural networks ‘uncool’ again, and maybe learn something that helps in the complex cases too.

Why swarm?

Because we can. Generalisation is hard. There’s always a chance that the particular network configuration you’ve achieved is a fluke. Erratic behaviour is common enough. People only share their best results. So while we’ve got a case simple enough to train as many different networks as we like, let’s do that. We’ll train a whole bunch of ‘bees’, which are new networks that have got the same architecture and same training regime. The random starting weights and biases are all that differentiate each bee. So by training a whole ‘swarm’ of bees, we can get a better sense of how the regime and architecture works, on average, rather than judging on a spurious result.

You can see that some of those bees in some of those swarms end up fitting the curve remarkably well. (The little ball represents the total loss for each bee. Lower is better.)

The left-hand column shows us the most basic case, where we have only 1 neuron. Then, all that the ‘network’ can do is fit a weight and bias to the input function itself. And what are those? Well, in this case we’re just contrasting three. The traditional ReLU is on the bottom, which is a just a straight, flat line, that tilts up to a straight incline at one. The others are trig functions, ‘tanh’ on top, and x-tanh in the middle. Being smoother, they fit the rounded shape of the sin curve much better. But apart from being smooth, they also have other nuanced traits. Notice that the tanh (top right) tends to fit the central slope most quickly. But even with a width of 10, struggles to capture the symmetric nature of the dip and hump either side of the cuve. The ‘x-tanh’ activations, whilst slower to reach the top and bottom of the dip and hump, fit the shape much better once they get there.

Let’s examine the actual functions to explore why.

Activation Functions

Let’s have a look at what the activation functions look like on their own.

It’s easy to see that the ‘tahh’ line fits the greatest part of the sine curve we’re approximating. Let’s have a look at how these shapes show up in the initial (random) starting positions of all the lines.

We can see here that what it looks like to randomly select curves/lines from the possibilities created by a single one of the activation functions with just one weight and bias. The relu curves (on the bottom left) have just a single kink. The tanh curves (on top) have a slope that’s steepest somewhere near the middle of the line. The x-tanh curves are flat somewhere in the middle, and steep(er) towards the ends. As more neurons are added in the middle and right columns, you can directly see how additional ‘kinks’ appear in the relu curves, and the trig functions can take more complex shapes.

Finally, let’s see an animation of how the training evolves over time.

The evolution of swarms over training

You can also embed plots, for example:

Now that’s kinda cool. (Ooops, I mean, really un-cool.) You can see a range of effects that beg further exploration. Some of the bees (individual lines) bounce around considerably, and the loss of the fits seems to oscillate. Some of them get into a good position fairly early on, and continue to improve quickly. Others find themselves in a poorer position, and struggle to catch up.

That said, given enough epochs, it looks as though almost all of these networks will find a decent fit, eventually. That’s a luxurious position we can arrive at quite easily with such a simple problem of 1-d noisless data, where training is cheap. But avoiding finding yourself stuck with a poor performer that hasn’t yet found its way a decent amount of training, or worse avoiding an architecture or training regime that doesn’t (quickly) converge to a good fit are very much enduring challenges in machine learning engineering.

In a subsequent post, we’ll explore how the number of epochs, momentum, and learning rate interact with our basic swarm examples.

Code for this post, can be found here, but if that link breaks (which it may temporarily as we tidy up), and even if it doesn’t, check out the code for the whole project.