# Deep Learning in JS – Ashi Krishnan – JSConf EU 2018

Hi, everyone. I’m Ashi, and I will be your

guide today as we describe explore the deep world of world of learning and JavaScript.

I’m not a machine-learning expert, sorry about that, but my mom is. She’s an audiologist

by training, and she did work in digital filters for hearing aids, and then later worked on

the acoustic model of a speech recogniser. I remember working in her lab one summer and

people were saying these strange and intimidating words, and there were all of these odd things

up on the wall, and I was overwhelmed. At the time I didn’t understand basically any

of it. So this process has also been impressive – a process of getting closer to her where

I can go to her and say, “I know what cross entropy is”, and she’s like, “That’s great,

can you explain it to me?!” I think I can, explain it to you and well enough for you

to use it. I’ve been learning deep learning out of curiosity with some excitement for

the future, and also with the sense of existential terror that the robots are coming and they’re

going to consume all of our jobs, and possibly our societies, and maybe ourselves. There

might be this matrix-pod situation that’s going to happen. I have some exciting news

which is that the robots are very, very impressive, and they’re also kind of stupid. Like stupid

in really fundamental ways. And so they’re probably not going to take your job – at least

not within the next year or two – but they’re going to change it, and change it quite dramatically.

And so now, it’s a really exciting time to be getting into this field. There’s a huge

amount of research and a lot of new tooling available to us. Let’s dive in. Before we

dive in, I want to give a single definition. I gave this talk, and a student was like,

“So you said tensor like 100 times, and that’s a scary word. I feel tense right now.” A tensor,

if you look it up on Wikipedia, it is a numeric field that is closed over some free operation.

I would say that a tensor is a block of numbers. We can have a block of numbers that is actually

just a single number, that is a ranked zero tensor or a scaler. We can have a block of

numbers, a line of numbers, a vector, a rink 1 tensor, matrices, newspaper of squares or

rectangles are ranked to tensors, we can have the prisms ranked to tensors, and on and on

that become progressively harder to draw, so I’m not going to draw them. I’ve just defined

tensors for you because I’m about to talk about Tensor Flow which is the state of the

art machine learning framework. And now, available in JavaScript. So, let’s break down what is

available in the available, so, in using the C++ API, I’m using the pipeline API, large

grasp of math operations on the CPU, on the GPU, to be able to do more operations at a

slightly lower precision in parallel, and, then on the TPU, which is like a GPU but with

even more, even crappier compute units. It is a special-purpose hardware that Google

made that is optimised for doing machine-learning particularly. It turns out that machine-learning

is a large stack of really simple operations, and so being able to parallelise over simple

operations is ideal. The JavaScript bindings currently give us CPU computation under Node,

and then the web bindings use WebGL to perform math. Soon, the node bindings, the Tensorflow

team promises will use the C++ backend which means we should have performance parity with

the Python libraries. Currently, the web bindings that use the GPU are half the performance

of the C++ library which is unfortunate, but you can do it in a browser, so that’s pretty

cool. The other important part about doing machine-learning research and developing these

models is the ecosystem around the core processing libraries that we are using, and the ecosystem

in Python is enormous, and the ecosystem in JavaScript is sad. And, that’s okay. If any

of the Propel folks or anyone doing scientific computation in JavaScript is here, I want

to say your work is wonderful, and I’m really looking forward to it, and the size of the

community is currently small, but, if the history of JavaScript frameworks is any indication,

we will quickly build up a large and interesting, and powerful ecosystem of software. It’s just

currently the case that, if you want to build your own extremely large deep-learning models

and train them on the kinds of data sets that you might need to train on multiple computers

in order to access, then you’re probably going to be doing that in Python in the cloud, but

you can take those models, and this is the exciting thing about tensorflow.js – you can

take them and run them in the browser. It means you can leverage the power of machine-learning

in the browser without sending all of your user’s data off to some provider in the sky,

and you can also continue to train those models locally. We can do something called “transfer

learning” where we cut off the last bit of the model, and we adapt it while not having

to retrain all of the model’s deep layers in order to give users machine learning, the

advantages of machine-learning without the privacy implications or the entanglement of

surveillance. I just said “model”, like, 500 times. What exactly are models? Let’s say

we’ve got this phenomena happening in the world and this is a snake or it’s a drawing

of a snake. We want to model it. We want to understand it in some way. We want a simplified

version of it. That’s what a model is: it is a simplified version of the world turned

into math. So, in this case, we are going to turn our snake into a squiggle. With machine-learning,

we go through there training process where we want to find the set of model parameters

that lets us fit the world as best we can. We can imagine trying different sets of parameters,

like different squiggles, kind of at random until we find one that works on this snake.

It is not ideal. We could sit here all day. We don’t have a great metric for how well

we are doing, and we don’t have a sense that we are making forward progress. So what we

would really like is to find a way to pick some set of parameters, squiggle, and its

iteratively improve it, and do what he what we do naturally while improving on our own

knowledge of the situation until we find a good fit. We can do that through a process

called stochastic gradient descent. If you’re a machine-learning expert in the audience,

there are a variety of gradient descent techniques. Let’s look at the simplest now. Let’s say

a splatter of paint and I with a n’t to model it. If I want to model a splatter of paint,

I would almost certainly not do it as a line, but I will do it as a line because there are

two parameters that makes it easy to visualise all the various things we need to visualise

for them. So I’m going to model the splatter of paint as a line, and we’re going to be

happy about it. First, I’m going to throw a co-ordinate system under it, and I’ve turned

these into X, Y points. I’m going to dig back into my suppressed memories of high school

algebra to remember that the equation for a line is Y=MXB. I have Y, the line, B is

the line intercept. If I pick random values for those two parameters, I’m going to get

a line. I need – any two random values will get me a line. This line is not a very good

line. So, at this point, it is way off, and these two points are pretty off, and, if we

go through and figure out that off-ness for the entire set of examples, then what we are

looking at is a quantity called “loss”. Loss, like that sensation you feel at the end of

a long relationship is a measure of how badly we did, how poorly, our model fit the data.

It is a machine-learning self-flagellation. A common kind of loss that we use, particularly

for regression, which is what we are doing right now, called mean-squared error, means

we take the average of the difference between the model and the ground true squared. If

we were to write it in JavaScript, it would look something like this. We can reduce over

data, find the difference between that data point as our model predicted it, and the actual

value of that data point, square it, divide it by length, and then that gives us this

function that we can pass in – we have in line function. We can pass in model parameters

here. And any two model parameters are going to yield a particular loss with respect to

this data. It means, because we have two of them, I can visualise it on a plane, and say

this is going to be the slope of our line – the slopiness – and how high up the X axis

it is, and, for some given set of model parameters, in fact, for every given set of model parameters,

there will be some loss. So what we can do now is figure out what that loss is and poke

around with that there, and what if my line was slopier? What if it is less slopy. What

about higher up or lower up? In one of those directions, we will be reducing loss, and

so we’re going to take a step in that direction along both axis. We will do it again. More

slopy, less slopy, higher or lower. Again and again. Each step, we’re using loss to

point us in the direction of movement. Loss is showing us where to go. And it’s revealing

for us a landscape of loss. So that, what we are kind of doing is we are finning the

slope of this landscape at each point, the general mapping term for the slope of the

landscape is its gradient, so the process that we are doing is gradient descent. We

are rolling down this landscape like rain drops into the valleys that are closest to

the ground truth. So there are a lot of ways we might tweak this process. One is to notice

that, if we are computing loss against all of the examples, all of the splatters of paint,

then it’s going to take a while. It’s not going to take that long for a line and X,

Y points, but, if we have much larger models, then it can quite expensive to compute loss,

so we might just grab a handful of examples, randomly. Stochastically, you might say if

you’re to say stochastically rather than randomly, so that gives us stochastic gradients of descent.

Other parameters we might choose are for example the size of the step we take. That’s called

the learning rate. These quantities, the size of the batch, like the number of examples

we look at, or the learning rate, they’re not learned, we don’t train them, and so they’re

not called model parameters, but rather hyper parameters which is a very exciting word,

I think, and the model doesn’t learn them during training, and we set it manually when

we train the model, typically by running hundreds of experiments, and staring at graphs until

our eyeballs bleed. Okay. So that is a line. It’s like a very simple, very simple function,

probably not very useful, right? There are other functions that we might use for deep

learning. For example, we might use one of a set of sigmoidal functions. These simulate

the neuron. Down here, the neuron is firing, here, it is not. And they’re smooth because

they – because that way, they’re different essential at every point..

[Sound distortion]. It’s a hard function and a complicated function. It’s the maths of X and zero. That’s it. That function

is pretty easy to think about. We could imagine right it, and – writing it in less than one

line of JavaScript. It turns out that that simplicity, and that ease of computability

makes it perfect for deep learning, where, again, we’re not doing very interesting or

complicated operations, we sure are doing a lot of them. We can imagine stacking up

these rectifiers, and here we are going to have four layers, four neurons densely interconnected

in two layers. Because they’re densely interconnected, we’re going to say that each of the neurons

in the second layer gets fed by all of the neurons in the first layer. So this one, for

example, its input is going to be the weighted sum of inputs from all of the neurons in the

previous layer, which, if you think about it, because of the shape of this function,

what we are really doing is nesting if statements. We’re nesting if statements with conditionals

whose values depend on the output of previous if statements, and whose thresholds are basically

entirely hard-coded. Next time you see reservers at Google have created a deep neural network

that does some impressive thing, just think researchers at Google have figured out how

to hard code 50 million random values in order to do some impressive thing, which is basically

what is going on. The impressive part is obviously that the training process figures out those

hard-coded values for ourselves, but, at the end, the thing the model is doing is basically

a giant pile of spaghetti code which fortunately models our brains pretty well. Even for a

model like this, a relatively small one, if we think about the number of interconnections

between these two layers of neurons, we see that we’ve got 16 of them. For a line, we

had two parameters, and we were able to think about its loss landscape. This model has 16

parameters, and I don’t know about you, but I have a really hard time visualising 17-dimensional

surfaces. It gets worse. What we are seeing revealed here is a visualisation of the loss

landscape for Resnet, which is an image classifier. Resnet has about 60 million parameters. It

means this is a heavy approximation. These folks have done some interesting projection

in order to get it even to resemble something three-dimensional. It has been said of the

terrifying things that live at the base of the sea and will one day wake to consume the

world that they have length, width, depth, and several other things, and perhaps this

is what Lovecraft was talking about. The good news is that you don’t have to train those

models. You don’t even have to think about them or hold their loss landscapes in your

head, because you can MPM install them! And it looks like – of course, if you want to

train those models, I highly encourage it. We’re going to look at an example of transfer

learning where we take a pretrained model and then train it to do something else. It

lets us leverage all of the training time on the larger, in this case, image-recognition

model, and then use it for a different problem space. So we’re going to do transfer learning,

and what we’re going to do, this is an example, you can pull it up on GitHub. I’m going to

play Pac-Man using my elephant friend Tallula. The way this works is I pick a bunch of examples

using my webcam that represent the images for up, down, left, and right. I’m rotating

to the left. I’m trying to be in the frame, trying not to be in the frame, get a representative

sense, or give the network a representative sense of where and how I’m going to be holding

her, which as you can see, I do not. We’re going to train it. It is pretty low. Then,

when I play, the network is going to highlight in yellow which direction it thinks I’m moving

in, and we can see that it works pretty well, at least until I start getting stressed about

Pac-Man and not holding Tallula in the same way I was during training. If you want to

ruin a friendship, using your friend as a controller is a pretty good way to do it!

Now I’m eaten. I’ve been eaten. And I’m happy to report that we are still friends! Thing

we do is, things zero we do is MPM install everything including Tensorflow. Thing one

we do is install Tensorflow. And then, we are going to load up the model. Our model

that you can also MPM install, this particular model is served off the web somewhere. And

because we are doing transfer learning, we’re going to do a little bit of surgery on the

model, so we’re going to pull out this layer conPW13relu, whatever that means, and then

we are going to construct a new model that has the maim inputs as mobile net but outputs

that low but not final layer. The actual final layer of mobile net is going to be, like,

200 probabilities, namely, the probability that this photo contains a cat. The probability

that this photo contains a cow. The probability that this photo contains a laptop, and on

and on throughout whatever class of images mobile net has been trained to recognise.

We want something before that, something where the image has been kind of reshaped into some

arbitrary chunk of interesting data, but has not yet been winnowed down to what it contains.

We loo look at that more in a second. When I’m adding examples, this is what is happening,

it’s controlling the data set, which is building up a data set of examples. Then we construct

our model. Our model is going to take the output of that layer of Tensorflow, it’s going

to flatten it, and it’s going to run it through a configurable number – let’s call this 100

– densely interconnected relu neurons, and, then, at the end, we will have a soft Max

layer. It is a different activation function which is useful for when you want a probability

distribution, so, in this case, we want the probability distribution. Num classes is going

to be four, because we have up, down, left, and right, and that’s all we are trying to

decide between. The output of our network is going to be the relative probabilities

that I am holding her up, left, right, or so on. So then we configure an optimiser.

We’re not using stochastic gradient descent, we are using Adam which is a stochastic gradient

technique which is better. It is a little bit smarter about how it decides the steps

it takes. We are going to compile this model with a loss function, and that loss function

is ” Cross-functional entropy. The reason being, if we have this example, an example

of me holding Tallula upside-down to indicate down, and the network predicts this – which

is technically predicting that I’m holding her right – how bad is that, really? Because,

this is, like, it’s pretty close – the prediction is wrong. If these were flipped, it would

still be kind of wrong that it thought that there was a ten per cent chance that I was

holding her correctly, you know? Answering that question is what categorical cross-entropy

does. It is how much did this model confuse different probability classes? And now you

know. And then finally, we call fit this this actually goes in start dispatching stuff off

on the GPU, and we get these callbacks every time a batch is finished. So every time we

have computed our loss for something would be we have updated the weights, and then we’ve

taken a step. All right. To play the game, we ask mobile net to do its prediction, we

run our model to give us one of four probability classes, and then we figure out which one

is the most likely, and we do it. And that’s packman. That is transfer learning with tense

or flow.js. I would like us to go back and understand what it is we are getting out of

mobile net, and, do to do, I’m going to load up mobile net, load up that JSON file we loaded

up earlier in the browser. Here, we will see that this is a Caris model that let’s us describe

a deep-learning system as a bunch of layers, and here, it is layers. Come on. Click. So

a deep-learning network that recognises images typically looks something like this. We’ve

got convolutional layers, and normalisation layers, and activation layers. The activation

layers we know what they look like – those are the relu layers we saw earlier. Normalisation

make sure our values are between zero and one-ish, and they do it across single batches,

which is why they’re call batch layers. Convolutional layers have the configuration parameters,

and, like many things in machine-learning, they sound hard but they’re not very hard.

Convolutions are basically Photoshop filters. If we have a whole bunch of input pixels,

a convolutional layer will grab some chunk of those pixels, run it through a filter,

and output it. It will walk the filter over the entire image producing an output image.

You will notice if we do this without allowing the filter to slide off the edge, then we

will get something slightly smaller. We can decide what it is we want. That’s one of the

many tuneable parameters. Convolutions come in all kinds of shapes and sizes. This one

is three by three. The key thing here is this filter which is the same across the entire

image, and it is trainable, which means that actually, let’s just see what it means. So

I’ve gone and gutted mobile net. Put it here. This is what is happening in each of those

many, many convolutional layers that we saw before. Yes, it looks like a bunch of crappy

Photoshop filters, because it is a bunch of crappy Photoshop filters. The interesting

thing here is that it has started to do edge-detection and other processes that will mimic the visual

distraction things that happens in our visual core twelve. This happens naturally when you

train a system that is able to create these isolated filters on an image classification

task that mimics the kinds of image classes that we ourselves recognise, as it starts

to do the same kind of processing that we do, which is an interesting validation of

the model, I think. So I hope this makes deep learning a little bit less scary, the realisation

that it is just a big pile of operations, a bunch of spaghetti code that has been tweaked

by a very simple but big process. Our world is fleshed with information, and our interaction

with it is heavily mediated increasingly by machine-learning systems. The systems are

not perfect. They’ve been trained on whatever data we’ve given them, and, like us, they

internalise the biases of that data, and, just like us, they can be pressed into the

service of whoever wants to wield them. There is this proof that neural networks are universal

approximators, which means any function you give them, they can approximate to some level

of precision. If you believe our own cognition is a computable function, then we’re moving

into the world where the fundamental tasks of cognition are now a thing that we can train

a machine to do. So these are not real faces. These were dreamt up by a deep-learning network

whose loss function is another network. The two networks improve each other, learning

to dream up things that appear to be people. And this is not Barak Obama. This is machine-learning

Obama, synced to a recording of an actual Obama speech. There exist systems that can

generate speech that sounds like anyone as well. So, how do we cope with this? With the

world where we can’t trust our own eyes and ears? One way is to ignore it, and to say

that these technologies are not that good – yet. But, if cognition is a computable function,

then our societies and ourselves are games, and robots will be it turns out, are very

good at playing games. In the history of computation, we see these tides. So, first, all-important

work was done on big main frames. And then processors improved, and work moved to personal

computers. Then networks improved, and we put everything in the cloud. And now we are

seeing the tide go out again as we begin to realise what we’ve given away, how much power

there is in knowing everything about everyone, and how much danger there is in us relying

on pick boxes outside of our control to feed us with knowledge. So my message to you is

that these technologies don’t have to be opaque, and they don’t have to be centralised, and

we can hold the power of robot minds in our pockets. We can use them to create, not just

to create forgeries but to discern truth before so this is just the beginning. Everything

we’ve seen here today, I think it’s quite impressive, and I think it is going to look

downright embarrassing in a few years when you can talk with euro bot — your robot assistant,

and the pattern of your voice will never leave your wrist. Thank you. [Cheering and Applause].

Here are some folks to follow if you’re interested in this.

I like how she explains things making them seem very simple in terms of easing you into vocabulary. Very good speaker.

She's amazing talker. I can't wait to see ML JS ecosystem grow, great tech

This has been the clearest and most colloquial introductory explanation of ML ever, + with super fun.

excellent talk!

Excellent presentation with great content. Looking forward to seeing JS grow in the area of ML.

Fantastic! I would love a series of more in-depth talks on ML from Ashi. Really appreciated how she presents exactly the right amount of detail to flesh out the terminology and concepts.

At the risk of just agreeing with all the other comments, super accessible talk, good work!

Well presented. Would take classes from her on Fullstack Academy

I really like the refreshing visuals. Great talk!

Ashi Krishnan

Ashi Krishnan

Ashi Ashi

Krishnan Krishnan 🎶🎵🎼

Ashi is eloquent and inspiring.

This woman is a great teacher.

I wasn't expecting a topic about "ML with JS" to be that good

Thanks for information

Awesome Talk. So Clear and explained well. Looking forward to more talks.

Really so cleanly explained and a way that a developer can easily understand, in other hand most of articles I read those well written too but explanations some time go over the head, resources like these are excellent and very useful.

Awesome ashi

One of the best talks 🙂

Brilliant! : I hope you will create courses or books on the subject as you have a very clear and interesting way of explaining 🙂

wow

Can anyone explain her last sentence? I think she says: "the pattern of your voice will never leave your wrist"

👍👍

Brings poetry to the subject!

OMG, this video is so good.

Ashi if you read this – This was THE most simplest and easiest introduction to ML I have ever come across. The slides/animations/visualizations were pretty cool and made so much sense. Many thanks!

Nice

FANTASTIC presentation. She packed more into 30 minutes than I've seen other speakers manage in one or two hours.

Such a inspiring talk. I am a fullstack js developer who started a few months ago to work in a very interesting deep learning project… For this project we usually send data from JS apps to Py models. The last week I improved some deep learning directly from a JS WPA frontend and I began to wonder about the limits of doing this

… after your talk I am sure that no matter where that limit is, it will be run forward very soon.

Thank you.

This talk is so top-notch, I don't even know what to say! Outstanding presentation explaining such a complex topic with an amazing and compassionate delivery

"Landscape of loss" is going to be the name for my new band