Deep Learning in JS – Ashi Krishnan – JSConf EU 2018

Deep Learning in JS – Ashi Krishnan – JSConf EU 2018

November 18, 2019 30 By Stanley Isaacs


Hi, everyone. I’m Ashi, and I will be your
guide today as we describe explore the deep world of world of learning and JavaScript.
I’m not a machine-learning expert, sorry about that, but my mom is. She’s an audiologist
by training, and she did work in digital filters for hearing aids, and then later worked on
the acoustic model of a speech recogniser. I remember working in her lab one summer and
people were saying these strange and intimidating words, and there were all of these odd things
up on the wall, and I was overwhelmed. At the time I didn’t understand basically any
of it. So this process has also been impressive – a process of getting closer to her where
I can go to her and say, “I know what cross entropy is”, and she’s like, “That’s great,
can you explain it to me?!” I think I can, explain it to you and well enough for you
to use it. I’ve been learning deep learning out of curiosity with some excitement for
the future, and also with the sense of existential terror that the robots are coming and they’re
going to consume all of our jobs, and possibly our societies, and maybe ourselves. There
might be this matrix-pod situation that’s going to happen. I have some exciting news
which is that the robots are very, very impressive, and they’re also kind of stupid. Like stupid
in really fundamental ways. And so they’re probably not going to take your job – at least
not within the next year or two – but they’re going to change it, and change it quite dramatically.
And so now, it’s a really exciting time to be getting into this field. There’s a huge
amount of research and a lot of new tooling available to us. Let’s dive in. Before we
dive in, I want to give a single definition. I gave this talk, and a student was like,
“So you said tensor like 100 times, and that’s a scary word. I feel tense right now.” A tensor,
if you look it up on Wikipedia, it is a numeric field that is closed over some free operation.
I would say that a tensor is a block of numbers. We can have a block of numbers that is actually
just a single number, that is a ranked zero tensor or a scaler. We can have a block of
numbers, a line of numbers, a vector, a rink 1 tensor, matrices, newspaper of squares or
rectangles are ranked to tensors, we can have the prisms ranked to tensors, and on and on
that become progressively harder to draw, so I’m not going to draw them. I’ve just defined
tensors for you because I’m about to talk about Tensor Flow which is the state of the
art machine learning framework. And now, available in JavaScript. So, let’s break down what is
available in the available, so, in using the C++ API, I’m using the pipeline API, large
grasp of math operations on the CPU, on the GPU, to be able to do more operations at a
slightly lower precision in parallel, and, then on the TPU, which is like a GPU but with
even more, even crappier compute units. It is a special-purpose hardware that Google
made that is optimised for doing machine-learning particularly. It turns out that machine-learning
is a large stack of really simple operations, and so being able to parallelise over simple
operations is ideal. The JavaScript bindings currently give us CPU computation under Node,
and then the web bindings use WebGL to perform math. Soon, the node bindings, the Tensorflow
team promises will use the C++ backend which means we should have performance parity with
the Python libraries. Currently, the web bindings that use the GPU are half the performance
of the C++ library which is unfortunate, but you can do it in a browser, so that’s pretty
cool. The other important part about doing machine-learning research and developing these
models is the ecosystem around the core processing libraries that we are using, and the ecosystem
in Python is enormous, and the ecosystem in JavaScript is sad. And, that’s okay. If any
of the Propel folks or anyone doing scientific computation in JavaScript is here, I want
to say your work is wonderful, and I’m really looking forward to it, and the size of the
community is currently small, but, if the history of JavaScript frameworks is any indication,
we will quickly build up a large and interesting, and powerful ecosystem of software. It’s just
currently the case that, if you want to build your own extremely large deep-learning models
and train them on the kinds of data sets that you might need to train on multiple computers
in order to access, then you’re probably going to be doing that in Python in the cloud, but
you can take those models, and this is the exciting thing about tensorflow.js – you can
take them and run them in the browser. It means you can leverage the power of machine-learning
in the browser without sending all of your user’s data off to some provider in the sky,
and you can also continue to train those models locally. We can do something called “transfer
learning” where we cut off the last bit of the model, and we adapt it while not having
to retrain all of the model’s deep layers in order to give users machine learning, the
advantages of machine-learning without the privacy implications or the entanglement of
surveillance. I just said “model”, like, 500 times. What exactly are models? Let’s say
we’ve got this phenomena happening in the world and this is a snake or it’s a drawing
of a snake. We want to model it. We want to understand it in some way. We want a simplified
version of it. That’s what a model is: it is a simplified version of the world turned
into math. So, in this case, we are going to turn our snake into a squiggle. With machine-learning,
we go through there training process where we want to find the set of model parameters
that lets us fit the world as best we can. We can imagine trying different sets of parameters,
like different squiggles, kind of at random until we find one that works on this snake.
It is not ideal. We could sit here all day. We don’t have a great metric for how well
we are doing, and we don’t have a sense that we are making forward progress. So what we
would really like is to find a way to pick some set of parameters, squiggle, and its
iteratively improve it, and do what he what we do naturally while improving on our own
knowledge of the situation until we find a good fit. We can do that through a process
called stochastic gradient descent. If you’re a machine-learning expert in the audience,
there are a variety of gradient descent techniques. Let’s look at the simplest now. Let’s say
a splatter of paint and I with a n’t to model it. If I want to model a splatter of paint,
I would almost certainly not do it as a line, but I will do it as a line because there are
two parameters that makes it easy to visualise all the various things we need to visualise
for them. So I’m going to model the splatter of paint as a line, and we’re going to be
happy about it. First, I’m going to throw a co-ordinate system under it, and I’ve turned
these into X, Y points. I’m going to dig back into my suppressed memories of high school
algebra to remember that the equation for a line is Y=MXB. I have Y, the line, B is
the line intercept. If I pick random values for those two parameters, I’m going to get
a line. I need – any two random values will get me a line. This line is not a very good
line. So, at this point, it is way off, and these two points are pretty off, and, if we
go through and figure out that off-ness for the entire set of examples, then what we are
looking at is a quantity called “loss”. Loss, like that sensation you feel at the end of
a long relationship is a measure of how badly we did, how poorly, our model fit the data.
It is a machine-learning self-flagellation. A common kind of loss that we use, particularly
for regression, which is what we are doing right now, called mean-squared error, means
we take the average of the difference between the model and the ground true squared. If
we were to write it in JavaScript, it would look something like this. We can reduce over
data, find the difference between that data point as our model predicted it, and the actual
value of that data point, square it, divide it by length, and then that gives us this
function that we can pass in – we have in line function. We can pass in model parameters
here. And any two model parameters are going to yield a particular loss with respect to
this data. It means, because we have two of them, I can visualise it on a plane, and say
this is going to be the slope of our line – the slopiness – and how high up the X axis
it is, and, for some given set of model parameters, in fact, for every given set of model parameters,
there will be some loss. So what we can do now is figure out what that loss is and poke
around with that there, and what if my line was slopier? What if it is less slopy. What
about higher up or lower up? In one of those directions, we will be reducing loss, and
so we’re going to take a step in that direction along both axis. We will do it again. More
slopy, less slopy, higher or lower. Again and again. Each step, we’re using loss to
point us in the direction of movement. Loss is showing us where to go. And it’s revealing
for us a landscape of loss. So that, what we are kind of doing is we are finning the
slope of this landscape at each point, the general mapping term for the slope of the
landscape is its gradient, so the process that we are doing is gradient descent. We
are rolling down this landscape like rain drops into the valleys that are closest to
the ground truth. So there are a lot of ways we might tweak this process. One is to notice
that, if we are computing loss against all of the examples, all of the splatters of paint,
then it’s going to take a while. It’s not going to take that long for a line and X,
Y points, but, if we have much larger models, then it can quite expensive to compute loss,
so we might just grab a handful of examples, randomly. Stochastically, you might say if
you’re to say stochastically rather than randomly, so that gives us stochastic gradients of descent.
Other parameters we might choose are for example the size of the step we take. That’s called
the learning rate. These quantities, the size of the batch, like the number of examples
we look at, or the learning rate, they’re not learned, we don’t train them, and so they’re
not called model parameters, but rather hyper parameters which is a very exciting word,
I think, and the model doesn’t learn them during training, and we set it manually when
we train the model, typically by running hundreds of experiments, and staring at graphs until
our eyeballs bleed. Okay. So that is a line. It’s like a very simple, very simple function,
probably not very useful, right? There are other functions that we might use for deep
learning. For example, we might use one of a set of sigmoidal functions. These simulate
the neuron. Down here, the neuron is firing, here, it is not. And they’re smooth because
they – because that way, they’re different essential at every point..
[Sound distortion]. It’s a hard function and a complicated function. It’s the maths of X and zero. That’s it. That function
is pretty easy to think about. We could imagine right it, and – writing it in less than one
line of JavaScript. It turns out that that simplicity, and that ease of computability
makes it perfect for deep learning, where, again, we’re not doing very interesting or
complicated operations, we sure are doing a lot of them. We can imagine stacking up
these rectifiers, and here we are going to have four layers, four neurons densely interconnected
in two layers. Because they’re densely interconnected, we’re going to say that each of the neurons
in the second layer gets fed by all of the neurons in the first layer. So this one, for
example, its input is going to be the weighted sum of inputs from all of the neurons in the
previous layer, which, if you think about it, because of the shape of this function,
what we are really doing is nesting if statements. We’re nesting if statements with conditionals
whose values depend on the output of previous if statements, and whose thresholds are basically
entirely hard-coded. Next time you see reservers at Google have created a deep neural network
that does some impressive thing, just think researchers at Google have figured out how
to hard code 50 million random values in order to do some impressive thing, which is basically
what is going on. The impressive part is obviously that the training process figures out those
hard-coded values for ourselves, but, at the end, the thing the model is doing is basically
a giant pile of spaghetti code which fortunately models our brains pretty well. Even for a
model like this, a relatively small one, if we think about the number of interconnections
between these two layers of neurons, we see that we’ve got 16 of them. For a line, we
had two parameters, and we were able to think about its loss landscape. This model has 16
parameters, and I don’t know about you, but I have a really hard time visualising 17-dimensional
surfaces. It gets worse. What we are seeing revealed here is a visualisation of the loss
landscape for Resnet, which is an image classifier. Resnet has about 60 million parameters. It
means this is a heavy approximation. These folks have done some interesting projection
in order to get it even to resemble something three-dimensional. It has been said of the
terrifying things that live at the base of the sea and will one day wake to consume the
world that they have length, width, depth, and several other things, and perhaps this
is what Lovecraft was talking about. The good news is that you don’t have to train those
models. You don’t even have to think about them or hold their loss landscapes in your
head, because you can MPM install them! And it looks like – of course, if you want to
train those models, I highly encourage it. We’re going to look at an example of transfer
learning where we take a pretrained model and then train it to do something else. It
lets us leverage all of the training time on the larger, in this case, image-recognition
model, and then use it for a different problem space. So we’re going to do transfer learning,
and what we’re going to do, this is an example, you can pull it up on GitHub. I’m going to
play Pac-Man using my elephant friend Tallula. The way this works is I pick a bunch of examples
using my webcam that represent the images for up, down, left, and right. I’m rotating
to the left. I’m trying to be in the frame, trying not to be in the frame, get a representative
sense, or give the network a representative sense of where and how I’m going to be holding
her, which as you can see, I do not. We’re going to train it. It is pretty low. Then,
when I play, the network is going to highlight in yellow which direction it thinks I’m moving
in, and we can see that it works pretty well, at least until I start getting stressed about
Pac-Man and not holding Tallula in the same way I was during training. If you want to
ruin a friendship, using your friend as a controller is a pretty good way to do it!
Now I’m eaten. I’ve been eaten. And I’m happy to report that we are still friends! Thing
we do is, things zero we do is MPM install everything including Tensorflow. Thing one
we do is install Tensorflow. And then, we are going to load up the model. Our model
that you can also MPM install, this particular model is served off the web somewhere. And
because we are doing transfer learning, we’re going to do a little bit of surgery on the
model, so we’re going to pull out this layer conPW13relu, whatever that means, and then
we are going to construct a new model that has the maim inputs as mobile net but outputs
that low but not final layer. The actual final layer of mobile net is going to be, like,
200 probabilities, namely, the probability that this photo contains a cat. The probability
that this photo contains a cow. The probability that this photo contains a laptop, and on
and on throughout whatever class of images mobile net has been trained to recognise.
We want something before that, something where the image has been kind of reshaped into some
arbitrary chunk of interesting data, but has not yet been winnowed down to what it contains.
We loo look at that more in a second. When I’m adding examples, this is what is happening,
it’s controlling the data set, which is building up a data set of examples. Then we construct
our model. Our model is going to take the output of that layer of Tensorflow, it’s going
to flatten it, and it’s going to run it through a configurable number – let’s call this 100
– densely interconnected relu neurons, and, then, at the end, we will have a soft Max
layer. It is a different activation function which is useful for when you want a probability
distribution, so, in this case, we want the probability distribution. Num classes is going
to be four, because we have up, down, left, and right, and that’s all we are trying to
decide between. The output of our network is going to be the relative probabilities
that I am holding her up, left, right, or so on. So then we configure an optimiser.
We’re not using stochastic gradient descent, we are using Adam which is a stochastic gradient
technique which is better. It is a little bit smarter about how it decides the steps
it takes. We are going to compile this model with a loss function, and that loss function
is ” Cross-functional entropy. The reason being, if we have this example, an example
of me holding Tallula upside-down to indicate down, and the network predicts this – which
is technically predicting that I’m holding her right – how bad is that, really? Because,
this is, like, it’s pretty close – the prediction is wrong. If these were flipped, it would
still be kind of wrong that it thought that there was a ten per cent chance that I was
holding her correctly, you know? Answering that question is what categorical cross-entropy
does. It is how much did this model confuse different probability classes? And now you
know. And then finally, we call fit this this actually goes in start dispatching stuff off
on the GPU, and we get these callbacks every time a batch is finished. So every time we
have computed our loss for something would be we have updated the weights, and then we’ve
taken a step. All right. To play the game, we ask mobile net to do its prediction, we
run our model to give us one of four probability classes, and then we figure out which one
is the most likely, and we do it. And that’s packman. That is transfer learning with tense
or flow.js. I would like us to go back and understand what it is we are getting out of
mobile net, and, do to do, I’m going to load up mobile net, load up that JSON file we loaded
up earlier in the browser. Here, we will see that this is a Caris model that let’s us describe
a deep-learning system as a bunch of layers, and here, it is layers. Come on. Click. So
a deep-learning network that recognises images typically looks something like this. We’ve
got convolutional layers, and normalisation layers, and activation layers. The activation
layers we know what they look like – those are the relu layers we saw earlier. Normalisation
make sure our values are between zero and one-ish, and they do it across single batches,
which is why they’re call batch layers. Convolutional layers have the configuration parameters,
and, like many things in machine-learning, they sound hard but they’re not very hard.
Convolutions are basically Photoshop filters. If we have a whole bunch of input pixels,
a convolutional layer will grab some chunk of those pixels, run it through a filter,
and output it. It will walk the filter over the entire image producing an output image.
You will notice if we do this without allowing the filter to slide off the edge, then we
will get something slightly smaller. We can decide what it is we want. That’s one of the
many tuneable parameters. Convolutions come in all kinds of shapes and sizes. This one
is three by three. The key thing here is this filter which is the same across the entire
image, and it is trainable, which means that actually, let’s just see what it means. So
I’ve gone and gutted mobile net. Put it here. This is what is happening in each of those
many, many convolutional layers that we saw before. Yes, it looks like a bunch of crappy
Photoshop filters, because it is a bunch of crappy Photoshop filters. The interesting
thing here is that it has started to do edge-detection and other processes that will mimic the visual
distraction things that happens in our visual core twelve. This happens naturally when you
train a system that is able to create these isolated filters on an image classification
task that mimics the kinds of image classes that we ourselves recognise, as it starts
to do the same kind of processing that we do, which is an interesting validation of
the model, I think. So I hope this makes deep learning a little bit less scary, the realisation
that it is just a big pile of operations, a bunch of spaghetti code that has been tweaked
by a very simple but big process. Our world is fleshed with information, and our interaction
with it is heavily mediated increasingly by machine-learning systems. The systems are
not perfect. They’ve been trained on whatever data we’ve given them, and, like us, they
internalise the biases of that data, and, just like us, they can be pressed into the
service of whoever wants to wield them. There is this proof that neural networks are universal
approximators, which means any function you give them, they can approximate to some level
of precision. If you believe our own cognition is a computable function, then we’re moving
into the world where the fundamental tasks of cognition are now a thing that we can train
a machine to do. So these are not real faces. These were dreamt up by a deep-learning network
whose loss function is another network. The two networks improve each other, learning
to dream up things that appear to be people. And this is not Barak Obama. This is machine-learning
Obama, synced to a recording of an actual Obama speech. There exist systems that can
generate speech that sounds like anyone as well. So, how do we cope with this? With the
world where we can’t trust our own eyes and ears? One way is to ignore it, and to say
that these technologies are not that good – yet. But, if cognition is a computable function,
then our societies and ourselves are games, and robots will be it turns out, are very
good at playing games. In the history of computation, we see these tides. So, first, all-important
work was done on big main frames. And then processors improved, and work moved to personal
computers. Then networks improved, and we put everything in the cloud. And now we are
seeing the tide go out again as we begin to realise what we’ve given away, how much power
there is in knowing everything about everyone, and how much danger there is in us relying
on pick boxes outside of our control to feed us with knowledge. So my message to you is
that these technologies don’t have to be opaque, and they don’t have to be centralised, and
we can hold the power of robot minds in our pockets. We can use them to create, not just
to create forgeries but to discern truth before so this is just the beginning. Everything
we’ve seen here today, I think it’s quite impressive, and I think it is going to look
downright embarrassing in a few years when you can talk with euro bot — your robot assistant,
and the pattern of your voice will never leave your wrist. Thank you. [Cheering and Applause].
Here are some folks to follow if you’re interested in this.