Neural networks are pretty cool right now. Especially deep ones, which are great for doing traditionally difficult high-dimensional tasks, like image classification. But figuring out how to design a good network and associated training regime for a given task is tricky. Many wouldn’t know where to start.
So let’s start at the very beginning… A simple feed-forward neural network with one hidden layer, trying to predict a simple, noiseless function (say, a sin wave) in just one dimension.
Sound’s boring?? Great! We can help make neural networks ‘uncool’ again, and maybe learn something that helps in the complex cases too.
Because we can. Generalisation is hard. There’s always a chance that the particular network configuration you’ve achieved is a fluke. Erratic behaviour is common enough. People only share their best results. So while we’ve got a case simple enough to train as many different networks as we like, let’s do that. We’ll train a whole bunch of ‘bees’, which are new networks that have got the same architecture and same training regime. The random starting weights and biases are all that differentiate each bee. So by training a whole ‘swarm’ of bees, we can get a better sense of how the regime and architecture works, on average, rather than judging on a spurious result.
You can see that some of those bees in some of those swarms end up fitting the curve remarkably well. (The little ball represents the total loss for each bee. Lower is better.)
The left-hand column shows us the most basic case, where we have only 1 neuron. Then, all that the ‘network’ can do is fit a weight and bias to the input function itself. And what are those? Well, in this case we’re just contrasting three. The traditional ReLU is on the bottom, which is a just a straight, flat line, that tilts up to a straight incline at one. The others are trig functions, ‘tanh’ on top, and x-tanh in the middle. Being smoother, they fit the rounded shape of the sin curve much better. But apart from being smooth, they also have other nuanced traits. Notice that the tanh (top right) tends to fit the central slope most quickly. But even with a width of 10, struggles to capture the symmetric nature of the dip and hump either side of the cuve. The ‘x-tanh’ activations, whilst slower to reach the top and bottom of the dip and hump, fit the shape much better once they get there.
Let’s examine the actual functions to explore why.
Let’s have a look at what the activation functions look like on their own.
It’s easy to see that the ‘tahh’ line fits the greatest part of the sine curve we’re approximating. Let’s have a look at how these shapes show up in the initial (random) starting positions of all the lines.
We can see here that what it looks like to randomly select curves/lines from the possibilities created by a single one of the activation functions with just one weight and bias. The relu curves (on the bottom left) have just a single kink. The tanh curves (on top) have a slope that’s steepest somewhere near the middle of the line. The x-tanh curves are flat somewhere in the middle, and steep(er) towards the ends. As more neurons are added in the middle and right columns, you can directly see how additional ‘kinks’ appear in the relu curves, and the trig functions can take more complex shapes.
Finally, let’s see an animation of how the training evolves over time.
You can also embed plots, for example: