This is a series of tutorials, in which we develop the mathematical and algorithmic underpinnings of deep neural networks from scratch and implement our own neural network library in Python, mimicking the TensorFlow API.
“I do not assume that you have any pre-knowledge about machine learning or neural networks. However, you should have some pre-knowledge of calculus, linear algebra, fundamental algorithms and probability theory on an undergraduate level.”
By the end of this text, you will have a deep understanding of the math behind neural networks and how deep learning libraries work under the hood.
“I have tried to keep the code as simple and concise as possible, favouring conceptual clarity over efficiency. Since our API mimics the TensorFlow API, you will know how to use TensorFlow once you have finished this text, and you will know how TensorFlow works under the hood conceptually (without all the overhead that comes with an omnipotent, maximally efficient machine learning API).”
The full source code of the API can be found at https://github.com/danielsabinasz/TensorSlow. You also find a Jupyter Notebook there, which is equivalent to this blog post but allows you to fiddle with the code.
We shall start by defining the concept of a computational graph, since neural networks are a special form thereof. A computational graph is a directed graph where the nodes correspond to operations or variables. Variables can feed their value into operations, and operations can feed their output into other operations. This way, every node in the graph defines a function of the variables.
The values that are fed into the nodes and come out of the nodes are called tensors, which is just a fancy word for a multi-dimensional array. Hence, it subsumes scalars, vectors and matrices as well as tensors of a higher rank.
In this video I illustrate the concept of computation graph and explain how to create them using Tensorflow.
Perceptrons are a miniature form of neural network and a basic building block of more complex architectures.
Great, so now we are able to classify points using a linear classifier and compute the probability that the point belongs to a certain class, provided that we know the appropriate parameters for the weight matrix W and bias b. The natural question that arises is how to come up with appropriate values for these. In the red/blue example, we just looked at the training points and guessed a line that nicely separated the training points. But generally, we do not want to specify the separating line by hand. Rather, we just want to supply the training points to the computer and let it come up with a good separating line on its own. But how do we judge whether a separating line is good or bad?
Gradient Descent and Backpropagation
Generally, if we want to find the minimum of a function, we set the derivative to zero and solve for the parameters. It turns out, however, that it is impossible to obtain a closed-form solution for WW and bb. Instead, we iteratively search for a minimum using a method called gradient descent.
As a visual analogy, imagine yourself standing on a mountain and trying to find the way down. At every step, you walk into the steepest direction, since this direction is the most promising to lead you towards the bottom.
In this video, I explain the mathematics behind Linear Regression with Gradient Descent.
In our implementation of gradient descent, we have used a function compute_gradient(loss) that computes the gradient of a loss operation in our computational graph with respect to the output of every other node n (i.e. the direction of change for n along which the loss increases the most). We now need to figure out how to compute gradients.
Backpropagation as simple as possible, but no simpler. Perhaps the most misunderstood part of neural networks, Backpropagation of errors is the key step that allows ANNs to learn. In this video, I give the derivation and thought processes behind backpropagation, using high school level calculus.
In this video, I move beyond the Simple Perceptron and discuss what happens when you build multiple layers of interconnected perceptrons (“fully-connected network”) for machine learning.
It is now time to say goodbye to our own toy library and start to get professional by switching to the actual TensorFlow.
As we’ve learned already, TensorFlow conceptually works exactly the same as our implementation. So why not just stick to our own implementation? There are a couple of reasons:
- TensorFlow is the product of years of effort in providing efficient implementations for all the algorithms relevant to our purposes. Fortunately, there are experts at Google whose everyday job is to optimize these implementations. We do not need to know all of these details. We only have to know what the algorithms do conceptually (which we do now) and how to call them.
- TensorFlow allows us to train our neural networks on the GPU (graphical processing unit), resulting in an enormous speed-up through massive parallelization.
- Google is now building Tensor processing units, which are integrated circuits specifically built to run and train http://www.deepideas.net/deep-learning-from-scratch-vi-tensorflow/ graphs, resulting in yet more enormous speedup.
- TensorFlow comes pre-equipped with a lot of neural network architectures that would be cumbersome to build on our own.
- TensorFlow comes with a high-level API called Keras that allows us to build neural network architectures way easier than by defining the computational graph by hand, as we did up until now. We will learn more about Keras in a later lesson.
So let’s get started. Installing TensorFlow is very easy.