Convolutions, I realised recently, loosely mean ‘do matrix multiplications over an entire image’, which hit me like a soft plastic hammer as I muttered a small ‘oh’ in the corner of a coffee shop nearby.
The cool thing seems to me to be the fact that the ‘image’ is actually an image as understood by my ageing laptop (and anyone else’s, really), as a bunch of numbers in a 2d array (or 3d array, if it is in colour). so convolutions are.. matrix multiplications in 2d. hmm.. I guess that means convolutions also happen in 1d and 3d and stuff…(what happens when you convolve an ASCIIgrid dataset of the world?) anyway here’s a cat:
google image search wizardry has decided that she is the world’s cutest kitten. (source)
so what sort of things do you do to the kitten? in fact, it seems a lot of filters that we use in Photoshop are convolutions. so it’s a filter? yes, and it’s also a kernel. google also says kernel is a small matrix that you plaster all over the kitty to make it different (no, not better, just.. different).
try doing this tutorial that does kitty transformation in tensorflow (compared to writing your own loops, because it reads better as a numpy array than as a nested list). Or go change the numbers in this one. below is what I got from the tf tutorial:
some randomly generated kernels. yes they’re kinda creepy. sorry kitty.
What are Convolutional Neural Networks then?
They’re kernels that learn to write themselves.
This one hit me like an iron hammer, and I was happy for quite a while after that.
What? Why do neural networks want to write their own blur filters? Oh no no its not a filter to the network, because of the way you train them. I ignominiously coin them Intelligence filters. haha. Filters in Convolutional Neural Networks learn what is the best way to ‘see’ something so that it can make a good guess at what it is.
In a Convolutional Neural Network, you stick filters layered in front of each other, each representing more and more of the portion of the image (actually you stick neurons arranged like filters) in front of your regression / classification network and train them with backpropagation (very loosely speaking of course).
so if you look at the above image, what is going on is:
input image > squeeze > squeeze > squeeze > squeeze > connect them all together > classify
The backpropagation goes all the way through, so that the filters that do the squeezing (extracting the ‘essense’ of a small part of the image) learns where it went wrong and where to change to make itself less wrong.
As a result, after training it learns which bits are important (almost white, because white means 1 here, which means activated), and because the neurons are laid out in 2d and it’s learning about images, we humans get to see which feature of the image is the neuron interested in.
I haven’t actually trained my own CNN yet, because I like knowing things in as much detail as I can before doing it, but when I do i’ll update this post!