backpropagation in grasshopper! (in ironpython actually)
(sorry about the small text in the gif, i’ll make a nicer one in the near future)
recently went on a quest for the full understanding of backpropagation (used in training neural networks (Machine Learning/AI)), and came upon this amazing blog post detailing the implementation of backpropagation in python.
I spent one day following through the code line by line, but still wasn’t able to grasp the structure of how it worked, so I did what I usually do when I couldn’t understand difficult theories: break it up into bits and implement them in grasshopper!
The result of the deconstructed backpropagation algorithm looks pretty straightforward in hindsight, but then hindsight is always 20/20.
Actually running the loop through 500 epochs (as mentioned in the blog post) took hazardously long, so the definition doesn’t do ‘real’ learning, but even after 25 epochs one could see the accuracy actually climbing.
By comparison, here’s a gif of the same implementation in pure(r) python (it’s ironpython, inside grasshopper, inside rhino3D). note that both implementations took out cross validation, and instead did a simple 66% train set and 33% test set split:
computing the same dataset for 25 epochs took 0.8 seconds. T_T
However, the grasshopper definition was, among other things, slow enough for the simple human mind to make some really important observations.
Here a few interesting observations so far:
First, Neural networks are in fact… a dictionary of weights! that’s it! (oh wait, see update) (it was so mindblowingly simple to me I had to walk around for a few minutes wondering if i missed something). In actual fact, it is a network of weights that incrementally update over many iterations to resemble an ‘abstraction’ of what the dataset is. they are then used to extrapolate data (among other things). The dictionary thing was due to the code being written in Python.
(update 20170820: actually, it is a dictionary of weights and an ‘activation function’, as I found out just a few days after this post but didn’t get around to updating it. an activation function is a function that puts the weights on a sloped graph (like a sigmoid function/tanh, or even y=mx+b which makes a sloped line) so that it knows whether going up or down is a good thing)
Think of training a neural network as ‘finding a polynomial(ish) function that fits a given set of numbers. Once you find a polynomial function, you can extrapolate from the function to guess new data, like below:
I found this to be extremely valuable in helping me intuitively grasp what a neural network ‘looks like’ (the gif above is the implementation of it in python).
In the same way, a neural network makes a best guess at this ‘function’ again and again, and it is tested against a given result (labels/classes/actual output that you use to compare), and at every iteration an error is calculated (error = how far away is the current mapping/function from the right answer) and backpropagated (see next paragraph). Note that nothing is absolute. the neural network’s ‘guess’ is a set of probabilities: 10% says its A, 63% says its B, and 27% says its C, so the best guess is B.
Secondly, Backpropagation means ‘pass that error from the result back layer by layer (opposite direction) and find out which neuron is responsible for how much of that error’, and is essentially the ‘blame game‘ played by the neurons in the network.
An anecdotal way of thinking about training goes like this:
‘the output neuron at the end of the line gets a result from his team of hard working neurons and sees that its off by -0.34, so the weights (which as numbers) need to arrange themselves so that they move upward by 0.34 to get it right. he shouts back down the line: ‘oi! which of you got it wrong?’ and the neurons huddle together and start assigning blame to each other, the first one saying, ‘i’m only reponsible for 0.1 parts of this result, check the neuron before me, he gave me 0.9 parts of the paperwork already’ and the second goes ‘i only did 0.3 parts of this, check the guy before me’ and so on and so forth, and then the output neuron sees how many parts of the blame fall on which neuron and finds that among all of them, actually the second guy (0.3) gave the most error. And he gave him a good whipping. XD The next time round, the second neuron was more careful in doing his paperwork.
And that, I believe, is my current understanding of how backpropagation works.
Third, This might be obvious to some people, but I also found that it helps if I thought of the layers of the neural network as the ‘lines’ in the network graph rather than the ‘dots’, at least in the process of coding it up. Because we type up each network layer as a function of mapping something to something else (input > hidden, hidden > output), and not a container that stores an ‘abstraction’ or mapping. I realised this after reading something like this:
and wondering ‘what do i need to define as my input layer!? you said there’s an input layer, where is it? its a magical layer that doesn’t exist? or its supposed to do neuron = input? and what is the hidden layer doing? its supposed to make an abstraction of the raw input? I don’t know, it just looks like its sitting there accepting neurons from the input layer…‘ (that monologue didn’t go anywhere)
Hence I posit that neural network maps be labeled this way in the future:
Then it would be quite clear that the hidden layer works by mixing all raw inputs together, an outputs an abstraction, and then the output layer uses the 3 (or more) abstractions of the data to make a slightly educated guess. Backpropagate and repeat.
Fourth, some observations.
Learning rate. I realised that the neural network’s accuracy basically just ‘sat there’ for the first 20 or so epochs because the weights in the neurons are changing at a rate that was so slow that it didn’t manage to tip the balance of probabilities (e.g. iteration 1 neuron: ‘oh, i got that wrong. lemme change it a bit’ -> iteration 2: ‘oh, its still wrong, lemme change it a bit again….’up to 20 epochs). Perhaps just plodding along at the same pace wasn’t the most efficient way of doing things. This warrants further investigation.
More Hidden Layers seem a bit counterintuitive at first. Until one reads Geoff Hinton’s slide in this post. Then one realizes he might just need to feed More Data!
Initial weights matter quite a bit. Starting off from the wrong foot means there’s a lot more distance to get to a good accuracy, and more distance means more computation time. Is this where Naive Bayes comes in? by having better initial weights, the network converges faster? With the caveat of exposing the network to bias? Will try this very soon.
So, my current takeaway from the past two months of reading and practising machine learning:
- Neural Networks are : a set of numbers (called weights) and get updated for n times, with an error check that tells the function how best to change the numbers to get less errors the next time round.
- I guess an article called ‘weight updater with error checking’ sounds like something headed for obscurity, so Neural Networks via Backpropagation is used instead. I mean, I wouldn’t want to name a building I designed a ‘terraced house with front porch’ for the same reasons.
- Of course, network algorithms are the secret sauce that I have yet to learn, so I expect this article to be updated as I go along!
All above stuff is purely anecdotal. I hope more knowledgeable readers might point out in the comments below the bits that I intuited wrongly, or bits of facts that have been peddled falsely or might lead to horrible misunderstandings down the road. I will update my beliefs *wink wink Bayesian Inference wink wink* based on these comments so as to arrive at a more accurate frame of understanding.